MoDA: Multi-modal Diffusion Architecture for Talking Head Generation

Authors

Xinyang Li^1,2, Gen Li², Zhihui Lin^1,3, Yichen Qian^1,3 †, Gongxin Yao², Weinan Jia¹, Aowen Wang¹, Weihua Chen^1,3, Fan Wang^1,3

¹Xunguang Team, DAMO Academy, Alibaba Group ²Zhejiang University ³Hupan Lab

^†Corresponding authors: [email protected], [email protected]

📂 Updates

[2025.08.08] 🔥 We release our inference codes and models.

⚙️ Installation

Create environment:

# 1. Create base environment
conda create -n moda python=3.10 -y
conda activate moda 

# 2. Install requirements
pip install -r requirements.txt

# 3. Install ffmpeg
sudo apt-get update  
sudo apt-get install ffmpeg -y

🚀 Inference

python src/models/inference/moda_test.py  --image_path src/examples/reference_images/6.jpg  --audio_path src/examples/driving_audios/5.wav

⚖️ Disclaimer

This project is intended for academic research, and we explicitly disclaim any responsibility for user-generated content. Users are solely liable for their actions while using the generative model. The project contributors have no legal affiliation with, nor accountability for, users' behaviors. It is imperative to use the generative model responsibly, adhering to both ethical and legal standards.

🙏🏻 Acknowledgements

We would like to thank the contributors to the LivePortrait, and echomimic,JoyVasa,Ditto, Open Facevid2vid, InsightFace, X-Pose, DiffPoseTalk, Hallo, wav2vec 2.0, Chinese Speech Pretrain, Q-Align, Syncnet, and VBench repositories, for their open research and extraordinary work. If we missed any open-source projects or related articles, we would like to complement the acknowledgement of this specific work immediately.

📑 Citation

If you use MoDA in your research, please cite:

@article{li2025moda,
  title={MoDA: Multi-modal Diffusion Architecture for Talking Head Generation},
  author={Li, Xinyang and Li, Gen and Lin, Zhihui and Qian, Yichen and Yao, GongXin and Jia, Weinan and Chen, Weihua and Wang, Fan},
  journal={arXiv preprint arXiv:2507.03256},
  year={2025}
}