Update README.md
Browse files
README.md
CHANGED
@@ -1,8 +1,9 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
-
|
4 |
-
|
5 |
-
|
|
|
6 |
---
|
7 |
|
8 |
# FastVideo FastWan2.1-T2V-1.3B-Diffusers Model
|
@@ -22,18 +23,31 @@ license: apache-2.0
|
|
22 |
|
23 |
|
24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
## Model Overview
|
26 |
-
|
27 |
-
-
|
28 |
-
-
|
29 |
-
-
|
30 |
-
-
|
|
|
31 |
- Try it out on **FastVideo** — we support a wide range of GPUs from **H100** to **4090**, and even support **Mac** users!
|
32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
|
34 |
|
35 |
|
36 |
-
If you use FastWan2.1-T2V-1.3B-Diffusers model for your research, please cite our paper:
|
37 |
```
|
38 |
@article{zhang2025vsa,
|
39 |
title={VSA: Faster Video Diffusion with Trainable Sparse Attention},
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- FastVideo/Wan-Syn_77x448x832_600k
|
5 |
+
base_model:
|
6 |
+
- Wan-AI/Wan2.1-T2V-1.3B-Diffusers
|
7 |
---
|
8 |
|
9 |
# FastVideo FastWan2.1-T2V-1.3B-Diffusers Model
|
|
|
23 |
|
24 |
|
25 |
|
26 |
+
## Introduction
|
27 |
+
|
28 |
+
This model is jointly finetuned with [DMD](https://arxiv.org/pdf/2405.14867) and [VSA](https://arxiv.org/pdf/2505.13389), based on [Wan-AI/Wan2.1-T2V-1.3B-Diffusers](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B-Diffusers). It supports efficient 3-step inference and generates high-quality videos at **61×448×832** resolution. We adopt the [FastVideo 480P Synthetic Wan dataset](https://huggingface.co/datasets/FastVideo/Wan-Syn_77x448x832_600k), consisting of 600k synthetic latents.
|
29 |
+
|
30 |
+
---
|
31 |
+
|
32 |
## Model Overview
|
33 |
+
|
34 |
+
- 3-step inference is supported and achieves up to **20 FPS** on a single **H100** GPU.
|
35 |
+
- Supports generating videos with resolution **61×448×832**.
|
36 |
+
- Finetuning and inference scripts are available in the [FastVideo](https://github.com/hao-ai-lab/FastVideo) repository:
|
37 |
+
- [Finetuning script](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/distill/v1_distill_dmd_wan_VSA.sh)
|
38 |
+
- [Inference script](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/inference/v1_inference_wan_dmd.sh)
|
39 |
- Try it out on **FastVideo** — we support a wide range of GPUs from **H100** to **4090**, and even support **Mac** users!
|
40 |
+
|
41 |
+
### Training Infrastructure
|
42 |
+
|
43 |
+
Training was conducted on **4 nodes with 32 H200 GPUs** in total, using a `global batch size = 64`.
|
44 |
+
We enable `gradient checkpointing`, set `gradient_accumulation_steps=2`, and use `learning rate = 1e-5`.
|
45 |
+
We set **VSA attention sparsity** to 0.8, and training runs for **4000 steps (~12 hours)**
|
46 |
+
Training example script is available [here](https://github.com/hao-ai-lab/FastVideo/blob/main/examples/distill/Wan-Syn-480P/distill_dmd_VSA_t2v.slurm).
|
47 |
|
48 |
|
49 |
|
50 |
+
If you use the FastWan2.1-T2V-1.3B-Diffusers model for your research, please cite our paper:
|
51 |
```
|
52 |
@article{zhang2025vsa,
|
53 |
title={VSA: Faster Video Diffusion with Trainable Sparse Attention},
|