Text-to-Video
Diffusers
Safetensors
WanDMDPipeline
File size: 4,647 Bytes
6901306
4b42dc1
 
75640eb
 
 
 
 
6901306
 
 
c6b68cc
23b86db
c6b68cc
6901306
 
 
 
 
 
 
75640eb
 
6901306
 
 
9939332
 
6901306
4b42dc1
12dba61
4b42dc1
581758e
4b42dc1
 
 
6901306
4b42dc1
5cdc386
c70a2bd
4b42dc1
aa98f43
cd4a0b3
4b42dc1
51fd951
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0e1623d
51fd951
 
 
 
 
 
 
 
 
b311ea4
4b42dc1
 
 
 
 
 
6901306
4b42dc1
6901306
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
base_model:
- Wan-AI/Wan2.1-T2V-1.3B-Diffusers
datasets:
- FastVideo/Wan-Syn_77x448x832_600k
license: apache-2.0
pipeline_tag: text-to-video
library_name: diffusers
---

# FastVideo FastWan2.1-T2V-1.3B-Diffusers Model
<p align="center">
  <img src="https://raw.githubusercontent.com/hao-ai-lab/FastVideo/main/assets/logo.png" width="200"/>
</p>
<div>
  <div align="center">
    <a href="https://github.com/hao-ai-lab/FastVideo" target="_blank">FastVideo Team</a>&emsp;
  </div>

  <div align="center">
    <a href="https://arxiv.org/pdf/2505.13389">Paper</a> | 
    <a href="https://github.com/hao-ai-lab/FastVideo">Github</a> |
    <a href="https://hao-ai-lab.github.io/FastVideo">Project Page</a>
  </div>
</div>

## Online Demo
You can try our models [here](https://fastwan.fastvideo.org/)!

## Introduction
We're excited to introduce the **FastWan2.1 series**—a new line of models finetuned with our novel **Sparse-distill** strategy. This approach jointly integrates DMD and VSA in a single training process, combining the benefits of both **distillation** to shorten diffusion steps and **sparse attention** to reduce attention computations, enabling even faster video generation.

FastWan2.1-T2V-1.3B-Diffusers is built upon Wan-AI/Wan2.1-T2V-1.3B-Diffusers. It supports efficient **3-step inference** and produces high-quality videos at 61×448×832 resolution. For training, we use the FastVideo 480P Synthetic Wan dataset, which contains 600k synthetic latents.

---

## Model Overview

- 3-step inference is supported and achieves up to **16 FPS** on a single **H100** GPU.
- Our model is trained on **61×448×832** resolution, but it supports generating videos with **any resolution**.(quality may degrade)
- Finetuning and inference scripts are available in the [FastVideo](https://github.com/hao-ai-lab/FastVideo) repository:  
  - [1 Node/GPU debugging finetuning script](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/distill/v1_distill_dmd_wan_VSA.sh)
  - [Slurm training example script](https://github.com/hao-ai-lab/FastVideo/blob/main/examples/distill/Wan2.1-T2V/Wan-Syn-Data-480P/distill_dmd_VSA_t2v_1.3B.slurm)  
  - [Inference script](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/inference/v1_inference_wan_dmd.sh)
```python
# install FastVideo and VSA first
git clone https://github.com/hao-ai-lab/FastVideo
pip install -e .
cd csrc/attn
git submodule update --init --recursive
python setup_vsa.py install

num_gpus=1
export FASTVIDEO_ATTENTION_BACKEND=VIDEO_SPARSE_ATTN
export MODEL_BASE=FastVideo/FastWan2.1-T2V-1.3B-Diffusers
# export MODEL_BASE=hunyuanvideo-community/HunyuanVideo
# You can either use --prompt or --prompt-txt, but not both.
fastvideo generate \
    --model-path $MODEL_BASE \
    --sp-size $num_gpus \
    --tp-size 1 \
    --num-gpus $num_gpus \
    --height 480 \
    --width 848 \
    --num-frames 81 \
    --num-inference-steps 3 \
    --fps 16 \
    --prompt-txt assets/prompt.txt \
    --negative-prompt "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards" \
    --seed 1024 \
    --output-path outputs_video_dmd/ \
    --VSA-sparsity 0.8 \
    --dmd-denoising-steps "1000,757,522"
```
- Try it out on **FastVideo** — we support a wide range of GPUs from **H100** to **4090**, and also support **Mac** users!

### Training Infrastructure

Training was conducted on **4 nodes with 32 H200 GPUs** in total, using a `global batch size = 64`.  
We enable `gradient checkpointing`, set `gradient_accumulation_steps=2`, and use `learning rate = 1e-5`.  
We set **VSA attention sparsity** to 0.8, and training runs for **4000 steps (~12 hours)**   

If you use the FastWan2.1-T2V-1.3B-Diffusers model for your research, please cite our paper:
```
@article{zhang2025vsa,
  title={VSA: Faster Video Diffusion with Trainable Sparse Attention},
  author={Zhang, Peiyuan and Huang, Haofeng and Chen, Yongqi and Lin, Will and Liu, Zhengzhong and Stoica, Ion and Xing, Eric and Zhang, Hao},
  journal={arXiv preprint arXiv:2505.13389},
  year={2025}
}
@article{zhang2025fast,
  title={Fast video generation with sliding tile attention},
  author={Zhang, Peiyuan and Chen, Yongqi and Su, Runlong and Ding, Hangliang and Stoica, Ion and Liu, Zhengzhong and Zhang, Hao},
  journal={arXiv preprint arXiv:2502.04507},
  year={2025}
}
```