File size: 6,204 Bytes
ff3aa28 f83c91e 431cdf1 6c985c7 4234098 6c985c7 86d1935 6c985c7 4234098 6c985c7 4234098 6c985c7 e77fbde 6c985c7 f83c91e 6c985c7 97be654 6c985c7 97be654 6c985c7 97be654 6c985c7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
---
pipeline_tag: text-to-video
library_name: mtvcraft
---
<h1 align='center' style='font-size: 3em;'>MTVCraft</h1>
<h2 align='center' style='font-size: 1.5em; color: grey; margin-top: 0; margin-bottom: 20px;'>An Open Veo3-style Audio-Video Generation Demo</h2>
<p align="center">
<a href="https://github.com/baaivision/MTVCraft">
<img src="https://img.shields.io/badge/Project%20Page-MTVCraft-yellow">
</a>
<a href="https://arxiv.org/pdf/2506.08003">
<img src="https://img.shields.io/badge/arXiv%20paper-2506.08003-red">
</a>
<a href="https://huggingface.co/spaces/BAAI/MTVCraft">
<img src="https://img.shields.io/badge/Online%20Demo-🤗-blue">
</a>
<!-- <br>
<a href="#pipeline">Pipeline</a> |
<a href="#installation">Installation</a> |
<a href="#download-pretrained-models">Models</a> |
<a href="#run-inference">Inference</a> |
<a href="#citation">Citation</a> -->
</p>
<table align='center' border="0" style="width: 100%; text-align: center;">
<tr>
<td align="center">
<video controls width="60%">
<source src="https://huggingface.co/BAAI/MTVCraft/resolve/main/video.mp4" type="video/mp4">
Sorry, your browser does not support the video tag.
</video>
<em>For the best experience, please enable audio.</em>
</td>
</tr>
</table>
## 🎬 Pipeline
MTVCraft is a framework for generating videos with synchronized audio from a single text prompt, exploring a potential pipeline for creating general audio-visual content.
Specifically, the framework consists of a multi-stage pipeline. First, MTVCraft employs the [Qwen3](https://bailian.console.aliyun.com/?tab=model#/model-market/detail/qwen3?modelGroup=qwen3) to interpret the user's initial prompt, deconstructing it into separate descriptions for three audio categories: human speech, sound effects, and background music. Subsequently, these descriptions are fed into [ElevenLabs](https://elevenlabs.io/) to synthesize the corresponding audio tracks. Finally, these generated audio tracks serve as conditions to guide the [MTV framework](https://arxiv.org/pdf/2506.08003) in generating a video that is temporally synchronized with the sound.
Notably, both Qwen3 and ElevenLabs can be replaced by available alternatives with similar capabilities.
<div align="center">
<img src="https://huggingface.co/BAAI/MTVCraft/resolve/main/pipeline.png" alt="MTVCraft Pipeline" width="60%">
</div>
## ⚙️ Installation
For CUDA 12.1, you can install the dependencies with the following commands. Otherwise, you need to manually install `torch`, `torchvision` , `torchaudio` and `xformers`.
Download the codes:
```bash
git clone https://github.com/suimuc/MTVCraft
cd MTVCraft
```
Create conda environment:
```bash
conda create -n mtv python=3.10
conda activate mtv
```
Install packages with `pip`
```bash
pip install -r requirements.txt
```
Besides, ffmpeg is also needed:
```bash
apt-get install ffmpeg
```
## 📥 Download Pretrained Models
You can easily get all pretrained models required by inference from our [HuggingFace repo](https://huggingface.co/BAAI/MTVCraft).
Using `huggingface-cli` to download the models:
```shell
cd $ProjectRootDir
pip install "huggingface_hub[cli]"
huggingface-cli download BAAI/MTVCraft --local-dir ./pretrained_models
```
Or you can download them separately from their source repo:
- [mtv](https://huggingface.co/BAAI/MTVCraft/tree/main/mtv): Our checkpoints
- [t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl): text encoder, you can download from [text_encoder](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/text_encoder) and [tokenizer](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/tokenizer)
- [vae](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT/tree/main/vae): Cogvideox-5b pretrained 3d vae
- [wav2vec](https://huggingface.co/facebook/wav2vec2-base-960h): wav audio to vector model from [Facebook](https://huggingface.co/facebook/wav2vec2-base-960h)
Finally, these pretrained models should be organized as follows:
```text
./pretrained_models/
|-- mtv
| |--single/
| | |-- 1/
| | |-- mp_rank_00_model_states.pt
| | `--latest
| |
| |--multi/
| | |-- 1/
| | |-- mp_rank_00_model_states.pt
| | `-- latest
| |
| `--accm/
| |-- 1/
| |-- mp_rank_00_model_states.pt
| `-- latest
|
|-- t5-v1_1-xxl/
| |-- config.json
| |-- model-00001-of-00002.safetensors
| |-- model-00002-of-00002.safetensors
| |-- model.safetensors.index.json
| |-- special_tokens_map.json
| |-- spiece.model
| `-- tokenizer_config.json
|
|-- vae/
| |--3d-vae.pt
|
`-- wav2vec2-base-960h/
|-- config.json
|-- feature_extractor_config.json
|-- model.safetensors
|-- preprocessor_config.json
|-- special_tokens_map.json
|-- tokenizer_config.json
`-- vocab.json
```
## 🎮 Run Inference
#### API Setup (Required)
Before running the inference script, make sure to configure your API keys in the file `mtv/utils.py`. Edit the following section:
```python
# mtv/utils.py
qwen_model_name = "qwen-plus" # or another model name you prefer
qwen_api_key = "YOUR_QWEN_API_KEY" # replace with your actual Qwen API key
client = OpenAI(
api_key=qwen_api_key,
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
elevenlabs = ElevenLabs(
api_key="YOUR_ELEVENLABS_API_KEY", # replace with your actual ElevenLabs API key
)
```
#### Batch
Once the API keys are set, you can run inference using the provided script:
```bash
bash scripts/inference_long.sh ./examples/samples.txt ouput_dir
```
This will read the input prompts from `./examples/samples.txt` and the results will be saved at `./output`.
#### Gradio UI
To run the Gradio UI simply run:
```bash
bash scripts/app.sh ouput_dir
```
## 📝 Citation
If you find our work useful for your research, please consider citing the paper:
```
@article{MTV,
title={Audio-Sync Video Generation with Multi-Stream Temporal Control},
author={Weng, Shuchen and Zheng, Haojie and Chang, Zheng and Li, Si and Shi, Boxin and Wang, Xinlong},
journal={arXiv preprint arXiv:2506.08003},
year={2025}
}
``` |