BAAI
/

Text-to-Video
MTVCraft
Safetensors
File size: 6,204 Bytes
ff3aa28
 
 
 
f83c91e
431cdf1
6c985c7
 
 
 
 
 
 
 
4234098
6c985c7
86d1935
6c985c7
 
4234098
6c985c7
 
4234098
 
 
6c985c7
 
 
e77fbde
 
 
 
 
 
 
 
 
 
 
 
 
6c985c7
 
 
 
 
 
 
 
 
 
f83c91e
6c985c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97be654
6c985c7
 
 
 
 
 
97be654
6c985c7
 
 
 
97be654
6c985c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
pipeline_tag: text-to-video
library_name: mtvcraft
---
<h1 align='center' style='font-size: 3em;'>MTVCraft</h1>
<h2 align='center' style='font-size: 1.5em; color: grey; margin-top: 0; margin-bottom: 20px;'>An Open Veo3-style Audio-Video Generation Demo</h2>



<p align="center">
    <a href="https://github.com/baaivision/MTVCraft">
        <img src="https://img.shields.io/badge/Project%20Page-MTVCraft-yellow">
    </a>
    <a href="https://arxiv.org/pdf/2506.08003">
        <img src="https://img.shields.io/badge/arXiv%20paper-2506.08003-red">
    </a>
    <a href="https://huggingface.co/spaces/BAAI/MTVCraft">
        <img src="https://img.shields.io/badge/Online%20Demo-🤗-blue">
    </a>
    <!-- <br> 
    <a href="#pipeline">Pipeline</a> |
    <a href="#installation">Installation</a> |
    <a href="#download-pretrained-models">Models</a> |
    <a href="#run-inference">Inference</a> |
    <a href="#citation">Citation</a> -->
</p>


<table align='center' border="0" style="width: 100%; text-align: center;">
  <tr>
    <td align="center">
      <video controls width="60%">
        <source src="https://huggingface.co/BAAI/MTVCraft/resolve/main/video.mp4" type="video/mp4">
        Sorry, your browser does not support the video tag.
      </video>
      <em>For the best experience, please enable audio.</em>
    </td>
  </tr>
</table>



## 🎬 Pipeline

MTVCraft is a framework for generating videos with synchronized audio from a single text prompt, exploring a potential pipeline for creating general audio-visual content.

Specifically, the framework consists of a multi-stage pipeline. First, MTVCraft employs the [Qwen3](https://bailian.console.aliyun.com/?tab=model#/model-market/detail/qwen3?modelGroup=qwen3) to interpret the user's initial prompt, deconstructing it into separate descriptions for three audio categories: human speech, sound effects, and background music. Subsequently, these descriptions are fed into [ElevenLabs](https://elevenlabs.io/) to synthesize the corresponding audio tracks. Finally, these generated audio tracks serve as conditions to guide the [MTV framework](https://arxiv.org/pdf/2506.08003) in generating a video that is temporally synchronized with the sound.

Notably, both Qwen3 and ElevenLabs can be replaced by available alternatives with similar capabilities.

<div align="center">
  <img src="https://huggingface.co/BAAI/MTVCraft/resolve/main/pipeline.png" alt="MTVCraft Pipeline" width="60%">
</div>

## ⚙️ Installation

For CUDA 12.1, you can install the dependencies with the following commands. Otherwise, you need to manually install `torch`, `torchvision` , `torchaudio` and `xformers`.

Download the codes:

```bash
git clone https://github.com/suimuc/MTVCraft
cd MTVCraft
```

Create conda environment:

```bash
conda create -n mtv python=3.10
conda activate mtv
```

Install packages with `pip`

```bash
pip install -r requirements.txt
```

Besides, ffmpeg is also needed:

```bash
apt-get install ffmpeg
```

## 📥 Download Pretrained Models

You can easily get all pretrained models required by inference from our [HuggingFace repo](https://huggingface.co/BAAI/MTVCraft).

Using `huggingface-cli` to download the models:

```shell
cd $ProjectRootDir
pip install "huggingface_hub[cli]"
huggingface-cli download BAAI/MTVCraft --local-dir ./pretrained_models
```

Or you can download them separately from their source repo:

- [mtv](https://huggingface.co/BAAI/MTVCraft/tree/main/mtv): Our checkpoints
- [t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl): text encoder, you can download from [text_encoder](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/text_encoder) and [tokenizer](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/tokenizer)
- [vae](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT/tree/main/vae): Cogvideox-5b pretrained 3d vae
- [wav2vec](https://huggingface.co/facebook/wav2vec2-base-960h): wav audio to vector model from [Facebook](https://huggingface.co/facebook/wav2vec2-base-960h)

Finally, these pretrained models should be organized as follows:

```text
./pretrained_models/
|-- mtv
|   |--single/
|   |   |-- 1/
|   |     |-- mp_rank_00_model_states.pt
|   |   `--latest
|   |
|   |--multi/
|   |   |-- 1/
|   |	  |-- mp_rank_00_model_states.pt
|   |   `-- latest
|   |
|   `--accm/
|       |-- 1/
|         |-- mp_rank_00_model_states.pt
|       `-- latest
|
|-- t5-v1_1-xxl/
|   |-- config.json
|   |-- model-00001-of-00002.safetensors
|   |-- model-00002-of-00002.safetensors
|   |-- model.safetensors.index.json
|   |-- special_tokens_map.json
|   |-- spiece.model
|   `-- tokenizer_config.json
|
|-- vae/
|   |--3d-vae.pt
|
`-- wav2vec2-base-960h/
    |-- config.json
    |-- feature_extractor_config.json
    |-- model.safetensors
    |-- preprocessor_config.json
    |-- special_tokens_map.json
    |-- tokenizer_config.json
    `-- vocab.json
```

## 🎮 Run Inference

#### API Setup (Required)
Before running the inference script, make sure to configure your API keys in the file `mtv/utils.py`. Edit the following section:
```python
# mtv/utils.py

qwen_model_name = "qwen-plus"  # or another model name you prefer
qwen_api_key = "YOUR_QWEN_API_KEY"  # replace with your actual Qwen API key

client = OpenAI(
    api_key=qwen_api_key,
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

elevenlabs = ElevenLabs(
    api_key="YOUR_ELEVENLABS_API_KEY",  # replace with your actual ElevenLabs API key
)
```

#### Batch

Once the API keys are set, you can run inference using the provided script:

```bash
bash scripts/inference_long.sh ./examples/samples.txt ouput_dir
```
This will read the input prompts from `./examples/samples.txt` and the results will be saved at `./output`.

#### Gradio UI
To run the Gradio UI simply run:
```bash
bash scripts/app.sh ouput_dir
```


## 📝 Citation

If you find our work useful for your research, please consider citing the paper:

```
@article{MTV,
      title={Audio-Sync Video Generation with Multi-Stream Temporal Control},
      author={Weng, Shuchen and Zheng, Haojie and Chang, Zheng and Li, Si and Shi, Boxin and Wang, Xinlong},
      journal={arXiv preprint arXiv:2506.08003},
      year={2025}
}
```