MoDA: Multi-modal Diffusion Architecture for Talking Head Generation
Authors
Xinyang Li1,2,
Gen Li2,
Zhihui Lin1,3,
Yichen Qian1,3 †,
Gongxin Yao2,
Weinan Jia1,
Aowen Wang1,
Weihua Chen1,3,
Fan Wang1,3
1Xunguang Team, DAMO Academy, Alibaba Group
2Zhejiang University
3Hupan Lab
†Corresponding authors: [email protected], [email protected]
📂 Updates
⚙️ Installation
Create environment:
# 1. Create base environment
conda create -n moda python=3.10 -y
conda activate moda
# 2. Install requirements
pip install -r requirements.txt
# 3. Install ffmpeg
sudo apt-get update
sudo apt-get install ffmpeg -y
🚀 Inference
python src/models/inference/moda_test.py --image_path src/examples/reference_images/6.jpg --audio_path src/examples/driving_audios/5.wav
⚖️ Disclaimer
This project is intended for academic research, and we explicitly disclaim any responsibility for user-generated content. Users are solely liable for their actions while using the generative model. The project contributors have no legal affiliation with, nor accountability for, users' behaviors. It is imperative to use the generative model responsibly, adhering to both ethical and legal standards.
🙏🏻 Acknowledgements
We would like to thank the contributors to the LivePortrait, and echomimic,JoyVasa,Ditto, Open Facevid2vid, InsightFace, X-Pose, DiffPoseTalk, Hallo, wav2vec 2.0, Chinese Speech Pretrain, Q-Align, Syncnet, and VBench repositories, for their open research and extraordinary work. If we missed any open-source projects or related articles, we would like to complement the acknowledgement of this specific work immediately.
📑 Citation
If you use MoDA in your research, please cite:
@article{li2025moda,
title={MoDA: Multi-modal Diffusion Architecture for Talking Head Generation},
author={Li, Xinyang and Li, Gen and Lin, Zhihui and Qian, Yichen and Yao, GongXin and Jia, Weinan and Chen, Weihua and Wang, Fan},
journal={arXiv preprint arXiv:2507.03256},
year={2025}
}