Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,393 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
---
|
4 |
+
# StableAvatar
|
5 |
+
|
6 |
+
<a href='https://francis-rings.github.io/StableAvatar'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2508.08248'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/FrancisRing/StableAvatar/tree/main'><img src='https://img.shields.io/badge/HuggingFace-Model-orange'></a> <a href='https://www.youtube.com/watch?v=6lhvmbzvv3Y'><img src='https://img.shields.io/badge/YouTube-Watch-red?style=flat-square&logo=youtube'></a> <a href='https://www.bilibili.com/video/'><img src='https://img.shields.io/badge/Bilibili-Watch-blue?style=flat-square&logo=bilibili'></a>
|
7 |
+
|
8 |
+
StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation
|
9 |
+
<br/>
|
10 |
+
*Shuyuan Tu<sup>1</sup>, Yueming Pan<sup>3</sup>, Yinming Huang<sup>1</sup>, Xintong Han<sup>4</sup>, Zhen Xing<sup>1</sup>, Qi Dai<sup>2</sup>, Chong Luo<sup>2</sup>, Zuxuan Wu<sup>1</sup>, Yu-Gang Jiang<sup>1</sup>
|
11 |
+
<br/>
|
12 |
+
[<sup>1</sup>Fudan University; <sup>2</sup>Microsoft Research Asia; <sup>3</sup>Xi'an Jiaotong University; <sup>4</sup>Hunyuan, Tencent Inc]
|
13 |
+
|
14 |
+
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
|
15 |
+
<tr>
|
16 |
+
<td>
|
17 |
+
<video src="https://github.com/user-attachments/assets/eac3ec34-1999-4a41-81fc-5f0a296a44b5" width="320" controls loop></video>
|
18 |
+
</td>
|
19 |
+
<td>
|
20 |
+
<video src="https://github.com/user-attachments/assets/b5902ac4-8188-4da8-b9e6-6df280690ed1" width="320" controls loop></video>
|
21 |
+
</td>
|
22 |
+
<td>
|
23 |
+
<video src="https://github.com/user-attachments/assets/87faa5c1-a118-4a03-a071-45f18e87e6a0" width="320" controls loop></video>
|
24 |
+
</td>
|
25 |
+
</tr>
|
26 |
+
<tr>
|
27 |
+
<td>
|
28 |
+
<video src="https://github.com/user-attachments/assets/531eb413-8993-4f8f-9804-e3c5ec5794d4" width="320" controls loop></video>
|
29 |
+
</td>
|
30 |
+
<td>
|
31 |
+
<video src="https://github.com/user-attachments/assets/cdc603e2-df46-4cf8-a14e-1575053f996f" width="320" controls loop></video>
|
32 |
+
</td>
|
33 |
+
<td>
|
34 |
+
<video src="https://github.com/user-attachments/assets/7022dc93-f705-46e5-b8fc-3a3fb755795c" width="320" controls loop></video>
|
35 |
+
</td>
|
36 |
+
</tr>
|
37 |
+
<tr>
|
38 |
+
<td>
|
39 |
+
<video src="https://github.com/user-attachments/assets/0ba059eb-ff6f-4d94-80e6-f758c613b737" width="320" controls loop></video>
|
40 |
+
</td>
|
41 |
+
<td>
|
42 |
+
<video src="https://github.com/user-attachments/assets/03e6c1df-85c6-448d-b40d-aacb8add4e45" width="320" controls loop></video>
|
43 |
+
</td>
|
44 |
+
<td>
|
45 |
+
<video src="https://github.com/user-attachments/assets/90b78154-dda0-4eaa-91fd-b5485b718a7f" width="320" controls loop></video>
|
46 |
+
</td>
|
47 |
+
</tr>
|
48 |
+
</table>
|
49 |
+
|
50 |
+
<p style="text-align: justify;">
|
51 |
+
<span>Audio-driven avatar videos generated by StableAvatar, showing its power to synthesize <b>infinite-length</b> and <b>ID-preserving videos</b>. All videos are <b>directly synthesized by StableAvatar without the use of any face-related post-processing tools</b>, such as the face-swapping tool FaceFusion or face restoration models like GFP-GAN and CodeFormer.</span>
|
52 |
+
</p>
|
53 |
+
|
54 |
+
<p align="center">
|
55 |
+
<video src="https://github.com/user-attachments/assets/90691318-311e-40b9-9bd9-62db83ab1492" width="768" autoplay loop muted playsinline></video>
|
56 |
+
<br/>
|
57 |
+
<span>Comparison results between StableAvatar and state-of-the-art (SOTA) audio-driven avatar video generation models highlight the superior performance of StableAvatar in delivering <b>infinite-length, high-fidelity, identity-preserving avatar animation</b>.</span>
|
58 |
+
</p>
|
59 |
+
|
60 |
+
|
61 |
+
## Overview
|
62 |
+
|
63 |
+
<p align="center">
|
64 |
+
<img src="assets/figures/framework.jpg" alt="model architecture" width="1280"/>
|
65 |
+
</br>
|
66 |
+
<i>The overview of the framework of StableAvatar.</i>
|
67 |
+
</p>
|
68 |
+
|
69 |
+
Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation.
|
70 |
+
We observe that the main reason preventing existing models from generating long videos lies in their audio modeling. They typically rely on third-party off-the-shelf extractors to obtain audio embeddings, which are then directly injected into the diffusion model via cross-attention. Since current diffusion backbones lack any audio-related priors, this approach causes severe latent distribution error accumulation across video clips, leading the latent distribution of subsequent segments to drift away from the optimal distribution gradually.
|
71 |
+
To address this, StableAvatar introduces a novel Time-step-aware Audio Adapter that prevents error accumulation via time-step-aware modulation. During inference, we propose a novel Audio Native Guidance Mechanism to further enhance the audio synchronization by leveraging the diffusion’s own evolving joint audio-latent prediction as a dynamic guidance signal. To enhance the smoothness of the infinite-length videos, we introduce a Dynamic Weighted Sliding-window Strategy that fuses latent over time. Experiments on benchmarks show the effectiveness of StableAvatar both qualitatively and quantitatively.
|
72 |
+
|
73 |
+
## News
|
74 |
+
* `[2025-8-11]`:🔥 The project page, code, technical report and [a basic model checkpoint](https://huggingface.co/FrancisRing/StableAvatar/tree/main) are released. Further lora training codes, the evaluation dataset and StableAvatar-pro will be released very soon. Stay tuned!
|
75 |
+
|
76 |
+
## 🛠️ To-Do List
|
77 |
+
- [x] StableAvatar-1.3B-basic
|
78 |
+
- [x] Inference Code
|
79 |
+
- [x] Data Pre-Processing Code (Audio Extraction)
|
80 |
+
- [x] Data Pre-Processing Code (Vocal Separation)
|
81 |
+
- [x] Training Code
|
82 |
+
- [ ] Lora Training Code (Before 2025.8.17)
|
83 |
+
- [ ] Lora Finetuning Code (Before 2025.8.17)
|
84 |
+
- [ ] Full Finetuning Code (Before 2025.8.17)
|
85 |
+
- [ ] Inference Code with Audio Native Guidance
|
86 |
+
- [ ] StableAvatar-pro
|
87 |
+
|
88 |
+
## 🔑 Quickstart
|
89 |
+
|
90 |
+
For the basic version of the model checkpoint (Wan2.1-1.3B-based), it supports generating <b>infinite-length videos at a 480x832 or 832x480 or 512x512 resolution</b>. If you encounter insufficient memory issues, you can appropriately reduce the number of animated frames or the resolution of the output.
|
91 |
+
|
92 |
+
### 🧱 Environment setup
|
93 |
+
|
94 |
+
```
|
95 |
+
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu124
|
96 |
+
pip install -r requirements.txt
|
97 |
+
# Optional to install flash_attn to accelerate attention computation
|
98 |
+
pip install flash_attn
|
99 |
+
```
|
100 |
+
|
101 |
+
### 🧱 Download weights
|
102 |
+
If you encounter connection issues with Hugging Face, you can utilize the mirror endpoint by setting the environment variable: `export HF_ENDPOINT=https://hf-mirror.com`.
|
103 |
+
Please download weights manually as follows:
|
104 |
+
```
|
105 |
+
pip install "huggingface_hub[cli]"
|
106 |
+
cd StableAvatar
|
107 |
+
mkdir checkpoints
|
108 |
+
huggingface-cli download FrancisRing/StableAvatar --local-dir ./checkpoints
|
109 |
+
```
|
110 |
+
All the weights should be organized in models as follows
|
111 |
+
The overall file structure of this project should be organized as follows:
|
112 |
+
```
|
113 |
+
StableAvatar/
|
114 |
+
├── accelerate_config
|
115 |
+
├── deepspeed_config
|
116 |
+
├── examples
|
117 |
+
├── wan
|
118 |
+
├── checkpoints
|
119 |
+
│ ├── Kim_Vocal_2.onnx
|
120 |
+
│ ├── wav2vec2-base-960h
|
121 |
+
│ ├── Wan2.1-Fun-V1.1-1.3B-InP
|
122 |
+
│ └── StableAvatar-1.3B
|
123 |
+
├── inference.py
|
124 |
+
├── inference.sh
|
125 |
+
├── train_1B_square.py
|
126 |
+
├── train_1B_square.sh
|
127 |
+
├── train_1B_vec_rec.py
|
128 |
+
├── train_1B_vec_rec.sh
|
129 |
+
├── audio_extractor.py
|
130 |
+
├── vocal_seperator.py
|
131 |
+
├── requirement.txt
|
132 |
+
```
|
133 |
+
|
134 |
+
### 🧱 Audio Extraction
|
135 |
+
Given the target video file (.mp4), you can use the following command to obtain the corresponding audio file (.wav):
|
136 |
+
```
|
137 |
+
python audio_extractor.py --video_path="path/test/video.mp4" --saved_audio_path="path/test/audio.wav"
|
138 |
+
```
|
139 |
+
|
140 |
+
### 🧱 Vocal Separation
|
141 |
+
As noisy background music may negatively impact the performance of StableAvatar to some extents, you can further separate the vocal from the audio file for better lip synchronization.
|
142 |
+
Given the path to an audio file (.wav), you can run the following command to extract the corresponding vocal signals:
|
143 |
+
```
|
144 |
+
pip install audio-separator
|
145 |
+
python vocal_seperator.py --audio_separator_model_file="path/StableAvatar/checkpoints/Kim_Vocal_2.onnx" --audio_file_path="path/test/audio.wav" --saved_vocal_path="path/test/vocal.wav"
|
146 |
+
```
|
147 |
+
|
148 |
+
### 🧱 Base Model inference
|
149 |
+
A sample configuration for testing is provided as `inference.sh`. You can also easily modify the various configurations according to your needs.
|
150 |
+
|
151 |
+
```
|
152 |
+
bash inference.sh
|
153 |
+
```
|
154 |
+
Wan2.1-1.3B-based StableAvatar supports audio-driven avatar video generation at three different resolution settings: 512x512, 480x832, and 832x480. You can modify "--width" and "--height" in `inference.sh` to set the resolution of the animation. "--output_dir" in `inference.sh` refers to the saved path of the generated animation. "--validation_reference_path", "--validation_driven_audio_path", and "--validation_prompts" in `inference.sh` refer to the path of the given reference image, the path of the given audio, and the text prompts respectively.
|
155 |
+
Prompts are also very important. It is recommended to `[Description of first frame]-[Description of human behavior]-[Description of background (optional)]`.
|
156 |
+
"--pretrained_model_name_or_path", "--pretrained_wav2vec_path", and "--transformer_path" in `inference.sh` are the paths of pretrained Wan2.1-1.3B weights, pretrained Wav2Vec2.0 weights, and pretrained StableAvatar weights, respectively.
|
157 |
+
"--sample_steps", "--overlap_window_length", and "--clip_sample_n_frames" refer to the total number of inference steps, the overlapping context length between two context windows, and the synthesized frame number in a batch/context window, respectively.
|
158 |
+
Notably, the recommended `--sample_steps` range is [30-50], more steps bring higher quality. The recommended `--overlap_window_length` range is [5-15], as longer overlapping length results in higher quality and slower inference speed.
|
159 |
+
"--sample_text_guide_scale" and "--sample_audio_guide_scale" are Classify-Free-Guidance scale of text prompt and audio. The recommended range for prompt and audio cfg is `[3-6]`. You can increase the audio cfg to facilitate the lip synchronization with audio.
|
160 |
+
|
161 |
+
We provide 6 cases in different resolution settings in `path/StableAvatar/examples` for validation. ❤️❤️Please feel free to try it out and enjoy the endless entertainment of infinite-length avatar video generation❤️❤️!
|
162 |
+
|
163 |
+
#### 💡Tips
|
164 |
+
- Wan2.1-1.3B-based StableAvatar weights have two versions: `transformer3d-square.pt` and `transformer3d-rec-vec.pt`, which are trained on two video datasets in two different resolution settings. Two versions both support generating audio-driven avatar video at three different resolution settings: 512x512, 480x832, and 832x480. You can modify `--transformer_path` in `inference.sh` to switch these two versions.
|
165 |
+
|
166 |
+
- If you have limited GPU resources, you can change the loading mode of StableAvatar by modifying "--GPU_memory_mode" in `inference.sh`. The options of "--GPU_memory_mode" are `model_full_load`, `sequential_cpu_offload`, `model_cpu_offload_and_qfloat8`, and `model_cpu_offload`. In particular, when you set `--GPU_memory_mode` to `sequential_cpu_offload`, the total GPU memory consumption is approximately 3G with slower inference speed.
|
167 |
+
Setting `--GPU_memory_mode` to `model_cpu_offload` can significantly cut GPU memory usage, reducing it by roughly half compared to `model_full_load` mode.
|
168 |
+
|
169 |
+
- If you have multiple Gpus, you can run Multi-GPU inference to speed up by modifying "--ulysses_degree" and "--ring_degree" in `inference.sh`. For example, if you have 8 GPUs, you can set `--ulysses_degree=4` and `--ring_degree=2`. Notably, you have to ensure ulysses_degree*ring_degree=total GPU number/world-size. Moreover, you can also add `--fsdp_dit` in `inference.sh` to activate FSDP in DiT to further reduce GPU memory consumption.
|
170 |
+
|
171 |
+
The video synthesized by StableAvatar is without audio. If you want to obtain the high quality MP4 file with audio, we recommend you to leverage ffmpeg on the <b>output_path</b> as follows:
|
172 |
+
```
|
173 |
+
ffmpeg -i video_without_audio.mp4 -i /path/audio.wav -c:v copy -c:a aac -shortest /path/output_with_audio.mp4
|
174 |
+
```
|
175 |
+
|
176 |
+
### 🧱 Model Training
|
177 |
+
<b>🔥🔥It’s worth noting that if you’re looking to train a conditioned Video Diffusion Transformer (DiT) model, such as Wan2.1, this training tutorial will also be helpful.🔥🔥</b>
|
178 |
+
For the training dataset, it has to be organized as follows:
|
179 |
+
|
180 |
+
```
|
181 |
+
talking_face_data/
|
182 |
+
├── rec
|
183 |
+
│ │ ├──speech
|
184 |
+
│ │ │ ├──00001
|
185 |
+
│ │ │ │ ├──sub_clip.mp4
|
186 |
+
│ │ │ │ ├──audio.wav
|
187 |
+
│ │ │ │ ├──images
|
188 |
+
│ │ │ │ │ ├──frame_0.png
|
189 |
+
│ │ │ │ │ ├──frame_1.png
|
190 |
+
│ │ │ │ │ ├──frame_2.png
|
191 |
+
│ │ │ │ │ ├──...
|
192 |
+
│ │ │ │ ├──face_masks
|
193 |
+
│ │ │ │ │ ├──frame_0.png
|
194 |
+
│ │ │ │ │ ├──frame_1.png
|
195 |
+
│ │ │ │ │ ├──frame_2.png
|
196 |
+
│ │ │ │ │ ├──...
|
197 |
+
│ │ │ │ ├──lip_masks
|
198 |
+
│ │ │ │ │ ├──frame_0.png
|
199 |
+
│ │ │ │ │ ├──frame_1.png
|
200 |
+
│ │ │ │ │ ├──frame_2.png
|
201 |
+
│ │ │ │ │ ├──...
|
202 |
+
│ │ │ ├──00002
|
203 |
+
│ │ │ │ ├──sub_clip.mp4
|
204 |
+
│ │ │ │ ├──audio.wav
|
205 |
+
│ │ │ │ ├──images
|
206 |
+
│ │ │ │ ├──face_masks
|
207 |
+
│ │ │ │ ├──lip_masks
|
208 |
+
│ │ │ └──...
|
209 |
+
│ │ ├──singing
|
210 |
+
│ │ │ ├──00001
|
211 |
+
│ │ │ │ ├──sub_clip.mp4
|
212 |
+
│ │ │ │ ├──audio.wav
|
213 |
+
│ │ │ │ ├──images
|
214 |
+
│ │ │ │ ├──face_masks
|
215 |
+
│ │ │ │ ├──lip_masks
|
216 |
+
│ │ │ └──...
|
217 |
+
│ │ ├──dancing
|
218 |
+
│ │ │ ├──00001
|
219 |
+
│ │ │ │ ├──sub_clip.mp4
|
220 |
+
│ │ │ │ ├──audio.wav
|
221 |
+
│ │ │ │ ├──images
|
222 |
+
│ │ │ │ ├──face_masks
|
223 |
+
│ │ │ │ ├──lip_masks
|
224 |
+
│ │ │ └──...
|
225 |
+
├── vec
|
226 |
+
│ │ ├──speech
|
227 |
+
│ │ │ ├──00001
|
228 |
+
│ │ │ │ ├──sub_clip.mp4
|
229 |
+
│ │ │ │ ├──audio.wav
|
230 |
+
│ │ │ │ ├──images
|
231 |
+
│ │ │ │ ├──face_masks
|
232 |
+
│ │ │ │ ├──lip_masks
|
233 |
+
│ │ │ └──...
|
234 |
+
│ │ ├──singing
|
235 |
+
│ │ │ ├──00001
|
236 |
+
│ │ │ │ ├──sub_clip.mp4
|
237 |
+
│ │ │ │ ├──audio.wav
|
238 |
+
│ │ │ │ ├──images
|
239 |
+
│ │ │ │ ├──face_masks
|
240 |
+
│ │ │ │ ├──lip_masks
|
241 |
+
│ │ │ └──...
|
242 |
+
│ │ ├──dancing
|
243 |
+
│ │ │ ├──00001
|
244 |
+
│ │ │ │ ├──sub_clip.mp4
|
245 |
+
│ │ │ │ ├──audio.wav
|
246 |
+
│ │ │ │ ├──images
|
247 |
+
│ │ │ │ ├──face_masks
|
248 |
+
│ │ │ │ ├──lip_masks
|
249 |
+
│ │ │ └──...
|
250 |
+
├── square
|
251 |
+
│ │ ├──speech
|
252 |
+
│ │ │ ├──00001
|
253 |
+
│ │ │ │ ├──sub_clip.mp4
|
254 |
+
│ │ │ │ ├──audio.wav
|
255 |
+
│ │ │ │ ├──images
|
256 |
+
│ │ │ │ ├──face_masks
|
257 |
+
│ │ │ │ ├──lip_masks
|
258 |
+
│ │ │ └──...
|
259 |
+
│ │ ├──singing
|
260 |
+
│ │ │ ├──00001
|
261 |
+
│ │ │ │ ├──sub_clip.mp4
|
262 |
+
│ │ │ │ ├──audio.wav
|
263 |
+
│ │ │ │ ├──images
|
264 |
+
│ │ │ │ ├──face_masks
|
265 |
+
│ │ │ │ ├──lip_masks
|
266 |
+
│ │ │ └──...
|
267 |
+
│ │ ├──dancing
|
268 |
+
│ │ │ ├──00001
|
269 |
+
│ │ │ │ ├──sub_clip.mp4
|
270 |
+
│ │ │ │ ├──audio.wav
|
271 |
+
│ │ │ │ ├──images
|
272 |
+
│ │ │ │ ├──face_masks
|
273 |
+
│ │ │ │ ├──lip_masks
|
274 |
+
│ │ │ └──...
|
275 |
+
├── video_rec_path.txt
|
276 |
+
├── video_square_path.txt
|
277 |
+
└── video_vec_path.txt
|
278 |
+
```
|
279 |
+
StableAvatar is trained on mixed-resolution videos, with 512x512 videos stored in `talking_face_data/square`, 480x832 videos stored in `talking_face_data/vec`, and 832x480 videos stored in `talking_face_data/rec`. Each folder in `talking_face_data/square` or `talking_face_data/rec` or `talking_face_data/vec` contains three subfolders which contains different types of videos (speech, singing, and dancing).
|
280 |
+
All `.png` image files are named in the format `frame_i.png`, such as `frame_0.png`, `frame_1.png`, and so on.
|
281 |
+
`00001`, `00002`, `00003` indicate individual video information.
|
282 |
+
In terms of three subfolders, `images`, `face_masks`, and `lip_masks` store RGB frames, corresponding human face masks, and corresponding human lip masks, respectively.
|
283 |
+
`sub_clip.mp4` and `audio.wav` refer to the corresponding RGB video of `images` and the corresponding audio file.
|
284 |
+
`video_square_path.txt`, `video_rec_path.txt`, and `video_vec_path.txt` record folder paths of `talking_face_data/square`, `talking_face_data/rec`, and `talking_face_data/vec`, respectively.
|
285 |
+
For example, the content of `video_rec_path.txt` is shown as follows:
|
286 |
+
```
|
287 |
+
path/StableAvatar/talking_face_data/rec/speech/00001
|
288 |
+
path/StableAvatar/talking_face_data/rec/speech/00002
|
289 |
+
...
|
290 |
+
path/StableAvatar/talking_face_data/rec/singing/00003
|
291 |
+
path/StableAvatar/talking_face_data/rec/singing/00004
|
292 |
+
...
|
293 |
+
path/StableAvatar/talking_face_data/rec/dancing/00005
|
294 |
+
path/StableAvatar/talking_face_data/rec/dancing/00006
|
295 |
+
...
|
296 |
+
```
|
297 |
+
If you only have raw videos, you can leverage `ffmpeg` to extract frames from raw videos (speech) and store them in the subfolder `images`.
|
298 |
+
```
|
299 |
+
ffmpeg -i raw_video_1.mp4 -q:v 1 -start_number 0 path/StableAvatar/talking_face_data/rec/speech/00001/images/frame_%d.png
|
300 |
+
```
|
301 |
+
The obtained frames are saved in `path/StableAvatar/talking_face_data/rec/speech/00001/images`.
|
302 |
+
|
303 |
+
For extracting the human face masks, please refer to [StableAnimator repo](https://github.com/Francis-Rings/StableAnimator). The Human Face Mask Extraction section in the tutorial provides off-the-shelf codes.
|
304 |
+
|
305 |
+
For extracting the human lip masks, you can run the following command:
|
306 |
+
```
|
307 |
+
pip install mediapipe
|
308 |
+
python lip_mask_extractor.py --folder_root="path/StableAvatar/talking_face_data/rec/singing" --start=1 --end=500
|
309 |
+
```
|
310 |
+
`--folder_root` refers to the root path of training datasets.
|
311 |
+
`--start` and `--end` specify the starting and ending indices of the selected training dataset. For example, `--start=1 --end=500` indicates that the human lip extraction will start at `path/StableAvatar/talking_face_data/rec/singing/00001` and end at `path/StableAvatar/talking_face_data/rec/singing/00500`.
|
312 |
+
|
313 |
+
For extraction details of corresponding audio, please refer to the Audio Extraction section.
|
314 |
+
When your dataset is organized exactly as outlined above, you can easily train your Wan2.1-1.3B-based StableAvatar by running the following command:
|
315 |
+
```
|
316 |
+
# Training StableAvatar on a single resolution setting (512x512) in a single machine
|
317 |
+
bash train_1B_square.sh
|
318 |
+
# Training StableAvatar on a single resolution setting (512x512) in multiple machines
|
319 |
+
bash train_1B_square_64.sh
|
320 |
+
# Training StableAvatar on a mixed resolution setting (480x832 and 832x480) in a single machine
|
321 |
+
bash train_1B_rec_vec.sh
|
322 |
+
# Training StableAvatar on a mixed resolution setting (480x832 and 832x480) in multiple machines
|
323 |
+
bash train_1B_rec_vec_64.sh
|
324 |
+
```
|
325 |
+
For the parameter details of `train_1B_square.sh` and `train_1B_rec_vec.sh`, `CUDA_VISIBLE_DEVICES` refers to gpu devices. In my setting, I use 4 NVIDIA A100 80G to train StableAvatar (`CUDA_VISIBLE_DEVICES=3,2,1,0`).
|
326 |
+
`--pretrained_model_name_or_path`, `--pretrained_wav2vec_path`, and `--output_dir` refer to the pretrained Wan2.1-1.3B path, pretrained Wav2Vec2.0 path, and the checkpoint saved path of the trained StableAvatar.
|
327 |
+
`--train_data_square_dir`, `--train_data_rec_dir`, and `--train_data_vec_dir` are the paths of `video_square_path.txt`, `video_rec_path.txt`, and `video_vec_path.txt`, respectively.
|
328 |
+
`--validation_reference_path` and `--validation_driven_audio_path` are paths of the validation reference image and the validation driven audio.
|
329 |
+
`--video_sample_n_frames` is the number of frames that StableAvatar processes in a single batch.
|
330 |
+
`--num_train_epochs` is the training epoch number. It is worth noting that the default number of training epochs is set to infinite. You can manually terminate the training process once you observe that your StableAvatar has reached its peak performance.
|
331 |
+
For the parameter details of `train_1B_square_64.sh` and `train_1B_rec_vec_64.sh`, we set the GPU configuration in `path/StableAvatar/accelerate_config/accelerate_config_machine_1B_multiple.yaml`. In my setting, the training setup consists of 8 nodes, each equipped with 8 NVIDIA A100 80GB GPUs, for training StableAvatar.
|
332 |
+
|
333 |
+
The overall file structure of StableAvatar at training is shown as follows:
|
334 |
+
```
|
335 |
+
StableAvatar/
|
336 |
+
├── accelerate_config
|
337 |
+
├── deepspeed_config
|
338 |
+
├── talking_face_data
|
339 |
+
├── examples
|
340 |
+
├── wan
|
341 |
+
├── checkpoints
|
342 |
+
│ ├── Kim_Vocal_2.onnx
|
343 |
+
│ ├── wav2vec2-base-960h
|
344 |
+
│ ├── Wan2.1-Fun-V1.1-1.3B-InP
|
345 |
+
│ └── StableAvatar-1.3B
|
346 |
+
├── inference.py
|
347 |
+
├── inference.sh
|
348 |
+
├── train_1B_square.py
|
349 |
+
├── train_1B_square.sh
|
350 |
+
├── train_1B_vec_rec.py
|
351 |
+
├── train_1B_vec_rec.sh
|
352 |
+
├── audio_extractor.py
|
353 |
+
├── vocal_seperator.py
|
354 |
+
├── requirement.txt
|
355 |
+
```
|
356 |
+
<b>It is worth noting that training StableAvatar requires approximately 50GB of VRAM due to the mixed-resolution (480x832 and 832x480) training pipeline.
|
357 |
+
However, if you train StableAvatar exclusively on 512x512 videos, the VRAM requirement is reduced to approximately 40GB.</b>
|
358 |
+
Additionally, The backgrounds of the selected training videos should remain static, as this helps the diffusion model calculate accurate reconstruction loss.
|
359 |
+
The audio should be clear and free from excessive background noise.
|
360 |
+
|
361 |
+
Regarding training Wan2.1-14B-based StableAvatar, you can run the following command:
|
362 |
+
```
|
363 |
+
# Training StableAvatar on a mixed resolution setting (480x832, 832x480, and 512x512) in multiple machines
|
364 |
+
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./checkpoints/Wan2.1-I2V-14B-480P
|
365 |
+
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./checkpoints/Wan2.1-I2V-14B-720P # Optional
|
366 |
+
bash train_14B.sh
|
367 |
+
```
|
368 |
+
We utilize deepspeed stage-2 to train Wan2.1-14B-based StableAvatar. The GPU configuration can be modified in `path/StableAvatar/accelerate_config/accelerate_config_machine_14B_multiple.yaml`.
|
369 |
+
The deepspeed optimization configuration and deepspeed scheduler configuration are in `path/StableAvatar/deepspeed_config/zero_stage2_config.json`.
|
370 |
+
Notably, we observe that Wan2.1-1.3B-based StableAvatar is already capable of synthesizing infinite-length high quality avatar videos. The Wan2.1-14B backbone significantly increase the inference latency and GPU memory consumption during training, indicating limited efficiency in terms of performance-to-resource ratio.
|
371 |
+
|
372 |
+
If you want to train 720P Wan2.1-1.3B-based or Wan2.1-14B-based StableAvatar, you can directly modify the height and width of the dataloader (480p-->720p) in `train_1B_square.py`/`train_1B_vec_rec.py`/`train_14B.py`.
|
373 |
+
|
374 |
+
### 🧱 VRAM requirement and Runtime
|
375 |
+
|
376 |
+
For the 5s video (480x832, fps=25), the basic model (--GPU_memory_mode="model_full_load") requires approximately 18GB VRAM and finishes in 3 minutes on a 4090 GPU.
|
377 |
+
|
378 |
+
<b>🔥🔥Theoretically, StableAvatar is capable of synthesizing hours of video without significant quality degradation; however, the 3D VAE decoder demands significant GPU memory, especially when decoding 10k+ frames. You have the option to run the VAE decoder on CPU.🔥🔥</b>
|
379 |
+
|
380 |
+
## Contact
|
381 |
+
If you have any suggestions or find our work helpful, feel free to contact me
|
382 |
+
|
383 |
+
Email: [email protected]
|
384 |
+
|
385 |
+
If you find our work useful, <b>please consider giving a star ⭐ to this github repository and citing it ❤️</b>:
|
386 |
+
```bib
|
387 |
+
@article{tu2025stableavatar,
|
388 |
+
title={StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation},
|
389 |
+
author={Tu, Shuyuan and Pan, Yueming and Huang, Yinming and Han, Xintong and Xing, Zhen and Dai, Qi and Luo, Chong and Wu, Zuxuan and Jiang Yu-Gang},
|
390 |
+
journal={arXiv preprint arXiv:2508.08248},
|
391 |
+
year={2025}
|
392 |
+
}
|
393 |
+
```
|