MTVCraft / README.md

Update README.md

ba648ae verified 2 months ago

6.23 kB

	---
	pipeline_tag: text-to-video
	library_name: mtvcraft
	license: apache-2.0
	---
	<h1 align='center' style='font-size: 3em;'>MTVCraft</h1>
	<h2 align='center' style='font-size: 1.5em; color: grey; margin-top: 0; margin-bottom: 20px;'>An Open Veo3-style Audio-Video Generation Demo</h2>



	<p align="center">
	<a href="https://github.com/baaivision/MTVCraft">
	<img src="https://img.shields.io/badge/Project%20Page-MTVCraft-yellow">
	</a>
	<a href="https://arxiv.org/pdf/2506.08003">
	<img src="https://img.shields.io/badge/arXiv%20paper-2506.08003-red">
	</a>
	<a href="https://huggingface.co/spaces/BAAI/MTVCraft">
	<img src="https://img.shields.io/badge/Online%20Demo-🤗-blue">
	</a>
	<!-- <br>
	<a href="#pipeline">Pipeline</a> \|
	<a href="#installation">Installation</a> \|
	<a href="#download-pretrained-models">Models</a> \|
	<a href="#run-inference">Inference</a> \|
	<a href="#citation">Citation</a> -->
	</p>


	<table align='center' border="0" style="width: 100%; text-align: center;">
	<tr>
	<td align="center">
	<video controls width="60%">
	<source src="https://huggingface.co/BAAI/MTVCraft/resolve/main/video.mp4" type="video/mp4">
	Sorry, your browser does not support the video tag.
	</video>
	<em>For the best experience, please enable audio.</em>
	</td>
	</tr>
	</table>



	## 🎬 Pipeline

	MTVCraft is a framework for generating videos with synchronized audio from a single text prompt, exploring a potential pipeline for creating general audio-visual content.

	Specifically, the framework consists of a multi-stage pipeline. First, MTVCraft employs the [Qwen3](https://bailian.console.aliyun.com/?tab=model#/model-market/detail/qwen3?modelGroup=qwen3) to interpret the user's initial prompt, deconstructing it into separate descriptions for three audio categories: human speech, sound effects, and background music. Subsequently, these descriptions are fed into [ElevenLabs](https://elevenlabs.io/) to synthesize the corresponding audio tracks. Finally, these generated audio tracks serve as conditions to guide the [MTV framework](https://arxiv.org/pdf/2506.08003) in generating a video that is temporally synchronized with the sound.

	Notably, both Qwen3 and ElevenLabs can be replaced by available alternatives with similar capabilities.

	<div align="center">
	<img src="https://huggingface.co/BAAI/MTVCraft/resolve/main/pipeline.png" alt="MTVCraft Pipeline" width="60%">
	</div>

	## ⚙️ Installation

	For CUDA 12.1, you can install the dependencies with the following commands. Otherwise, you need to manually install `torch`, `torchvision` , `torchaudio` and `xformers`.

	Download the codes:

	```bash
	git clone https://github.com/baaivision/MTVCraft
	cd MTVCraft
	```

	Create conda environment:

	```bash
	conda create -n mtv python=3.10
	conda activate mtv
	```

	Install packages with `pip`

	```bash
	pip install -r requirements.txt
	```

	Besides, ffmpeg is also needed:

	```bash
	apt-get install ffmpeg
	```

	## 📥 Download Pretrained Models

	You can easily get all pretrained models required by inference from our [HuggingFace repo](https://huggingface.co/BAAI/MTVCraft).

	Using `huggingface-cli` to download the models:

	```shell
	cd $ProjectRootDir
	pip install "huggingface_hub[cli]"
	huggingface-cli download BAAI/MTVCraft --local-dir ./pretrained_models
	```

	Or you can download them separately from their source repo:

	- [mtv](https://huggingface.co/BAAI/MTVCraft/tree/main/mtv): Our checkpoints
	- [t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl): text encoder, you can download from [text_encoder](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/text_encoder) and [tokenizer](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/tokenizer)
	- [vae](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT/tree/main/vae): Cogvideox-5b pretrained 3d vae
	- [wav2vec](https://huggingface.co/facebook/wav2vec2-base-960h): wav audio to vector model from [Facebook](https://huggingface.co/facebook/wav2vec2-base-960h)

	Finally, these pretrained models should be organized as follows:

	```text
	./pretrained_models/
	\|-- mtv
	\| \|--single/
	\| \| \|-- 1/
	\| \| \|-- mp_rank_00_model_states.pt
	\| \| `--latest
	\| \|
	\| \|--multi/
	\| \| \|-- 1/
	\| \| \|-- mp_rank_00_model_states.pt
	\| \| `-- latest
	\| \|
	\| `--accm/
	\| \|-- 1/
	\| \|-- mp_rank_00_model_states.pt
	\| `-- latest
	\|
	\|-- t5-v1_1-xxl/
	\| \|-- config.json
	\| \|-- model-00001-of-00002.safetensors
	\| \|-- model-00002-of-00002.safetensors
	\| \|-- model.safetensors.index.json
	\| \|-- special_tokens_map.json
	\| \|-- spiece.model
	\| `-- tokenizer_config.json
	\|
	\|-- vae/
	\| \|--3d-vae.pt
	\|
	`-- wav2vec2-base-960h/
	\|-- config.json
	\|-- feature_extractor_config.json
	\|-- model.safetensors
	\|-- preprocessor_config.json
	\|-- special_tokens_map.json
	\|-- tokenizer_config.json
	`-- vocab.json
	```

	## 🎮 Run Inference

	#### API Setup (Required)
	Before running the inference script, make sure to configure your API keys in the file `mtv/utils.py`. Edit the following section:
	```python
	# mtv/utils.py

	qwen_model_name = "qwen-plus" # or another model name you prefer
	qwen_api_key = "YOUR_QWEN_API_KEY" # replace with your actual Qwen API key

	client = OpenAI(
	api_key=qwen_api_key,
	base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
	)

	elevenlabs = ElevenLabs(
	api_key="YOUR_ELEVENLABS_API_KEY", # replace with your actual ElevenLabs API key
	)
	```

	#### Batch

	Once the API keys are set, you can run inference using the provided script:

	```bash
	bash scripts/inference_long.sh ./examples/samples.txt ouput_dir
	```
	This will read the input prompts from `./examples/samples.txt` and the results will be saved at `./output`.

	#### Gradio UI
	To run the Gradio UI simply run:
	```bash
	bash scripts/app.sh ouput_dir
	```


	## 📝 Citation

	If you find our work useful for your research, please consider citing the paper:

	```
	@article{MTV,
	title={Audio-Sync Video Generation with Multi-Stream Temporal Control},
	author={Weng, Shuchen and Zheng, Haojie and Chang, Zheng and Li, Si and Shi, Boxin and Wang, Xinlong},
	journal={arXiv preprint arXiv:2506.08003},
	year={2025}
	}
	```