Text-to-Video
Diffusers
Safetensors
WanDMDPipeline
BrianChen1129 commited on
Commit
4b42dc1
·
verified ·
1 Parent(s): 395c74e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -10
README.md CHANGED
@@ -1,8 +1,9 @@
1
  ---
2
  license: apache-2.0
3
- ---
4
- ---
5
- license: apache-2.0
 
6
  ---
7
 
8
  # FastVideo FastWan2.1-T2V-1.3B-Diffusers Model
@@ -22,18 +23,31 @@ license: apache-2.0
22
 
23
 
24
 
 
 
 
 
 
 
25
  ## Model Overview
26
- - This model is jointly finetuned with [DMD](https://arxiv.org/pdf/2405.14867) and [VSA](https://arxiv.org/pdf/2505.13389), based on [Wan-AI/Wan2.1-T2V-1.3B-Diffusers](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B-Diffusers).
27
- - It was trained on 8 nodes with 64 H200 GPUs in total, using a batch size of 64. The example slurm script can be found [here](https://github.com/hao-ai-lab/FastVideo/blob/main/examples/distill/Wan-Syn-480P/distill_dmd_VSA_t2v.slurm)
28
- - It supports 3-step inference and achieves up to **20 FPS** on a single **H100** GPU.
29
- - Supports generating videos with **61×448×832** resolution.
30
- - Both [finetuning](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/distill/v1_distill_dmd_wan_VSA.sh) and [inference](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/inference/v1_inference_wan_dmd.sh) scripts are available in the [FastVideo](https://github.com/hao-ai-lab/FastVideo) repository.
 
31
  - Try it out on **FastVideo** — we support a wide range of GPUs from **H100** to **4090**, and even support **Mac** users!
32
- - We use [FastVideo 480P Synthetic Wan dataset](https://huggingface.co/datasets/FastVideo/Wan-Syn_77x448x832_600k) for training.
 
 
 
 
 
 
33
 
34
 
35
 
36
- If you use FastWan2.1-T2V-1.3B-Diffusers model for your research, please cite our paper:
37
  ```
38
  @article{zhang2025vsa,
39
  title={VSA: Faster Video Diffusion with Trainable Sparse Attention},
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - FastVideo/Wan-Syn_77x448x832_600k
5
+ base_model:
6
+ - Wan-AI/Wan2.1-T2V-1.3B-Diffusers
7
  ---
8
 
9
  # FastVideo FastWan2.1-T2V-1.3B-Diffusers Model
 
23
 
24
 
25
 
26
+ ## Introduction
27
+
28
+ This model is jointly finetuned with [DMD](https://arxiv.org/pdf/2405.14867) and [VSA](https://arxiv.org/pdf/2505.13389), based on [Wan-AI/Wan2.1-T2V-1.3B-Diffusers](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B-Diffusers). It supports efficient 3-step inference and generates high-quality videos at **61×448×832** resolution. We adopt the [FastVideo 480P Synthetic Wan dataset](https://huggingface.co/datasets/FastVideo/Wan-Syn_77x448x832_600k), consisting of 600k synthetic latents.
29
+
30
+ ---
31
+
32
  ## Model Overview
33
+
34
+ - 3-step inference is supported and achieves up to **20 FPS** on a single **H100** GPU.
35
+ - Supports generating videos with resolution **61×448×832**.
36
+ - Finetuning and inference scripts are available in the [FastVideo](https://github.com/hao-ai-lab/FastVideo) repository:
37
+ - [Finetuning script](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/distill/v1_distill_dmd_wan_VSA.sh)
38
+ - [Inference script](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/inference/v1_inference_wan_dmd.sh)
39
  - Try it out on **FastVideo** — we support a wide range of GPUs from **H100** to **4090**, and even support **Mac** users!
40
+
41
+ ### Training Infrastructure
42
+
43
+ Training was conducted on **4 nodes with 32 H200 GPUs** in total, using a `global batch size = 64`.
44
+ We enable `gradient checkpointing`, set `gradient_accumulation_steps=2`, and use `learning rate = 1e-5`.
45
+ We set **VSA attention sparsity** to 0.8, and training runs for **4000 steps (~12 hours)**
46
+ Training example script is available [here](https://github.com/hao-ai-lab/FastVideo/blob/main/examples/distill/Wan-Syn-480P/distill_dmd_VSA_t2v.slurm).
47
 
48
 
49
 
50
+ If you use the FastWan2.1-T2V-1.3B-Diffusers model for your research, please cite our paper:
51
  ```
52
  @article{zhang2025vsa,
53
  title={VSA: Faster Video Diffusion with Trainable Sparse Attention},