FastVideo
/

FastWan2.1-T2V-1.3B-Diffusers

@@ -1,8 +1,9 @@
 ---
 license: apache-2.0
----
----
-license: apache-2.0
 ---
 # FastVideo FastWan2.1-T2V-1.3B-Diffusers Model
@@ -22,18 +23,31 @@ license: apache-2.0
 ## Model Overview
-- This model is jointly finetuned with [DMD](https://arxiv.org/pdf/2405.14867) and [VSA](https://arxiv.org/pdf/2505.13389), based on [Wan-AI/Wan2.1-T2V-1.3B-Diffusers](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B-Diffusers).
-- It was trained on 8 nodes with 64 H200 GPUs in total, using a batch size of 64. The example slurm script can be found [here](https://github.com/hao-ai-lab/FastVideo/blob/main/examples/distill/Wan-Syn-480P/distill_dmd_VSA_t2v.slurm)
-- It supports 3-step inference and achieves up to **20 FPS** on a single **H100** GPU.
-- Supports generating videos with **61×448×832** resolution.
-- Both [finetuning](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/distill/v1_distill_dmd_wan_VSA.sh) and [inference](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/inference/v1_inference_wan_dmd.sh) scripts are available in the [FastVideo](https://github.com/hao-ai-lab/FastVideo) repository.
 - Try it out on **FastVideo** — we support a wide range of GPUs from **H100** to **4090**, and even support **Mac** users!
-- We use [FastVideo 480P Synthetic Wan dataset](https://huggingface.co/datasets/FastVideo/Wan-Syn_77x448x832_600k) for training.
-If you use FastWan2.1-T2V-1.3B-Diffusers model for your research, please cite our paper:
 ```
 @article{zhang2025vsa,
   title={VSA: Faster Video Diffusion with Trainable Sparse Attention},

 ---
 license: apache-2.0
+datasets:
+- FastVideo/Wan-Syn_77x448x832_600k
+base_model:
+- Wan-AI/Wan2.1-T2V-1.3B-Diffusers
 ---
 # FastVideo FastWan2.1-T2V-1.3B-Diffusers Model
+## Introduction
+This model is jointly finetuned with [DMD](https://arxiv.org/pdf/2405.14867) and [VSA](https://arxiv.org/pdf/2505.13389), based on [Wan-AI/Wan2.1-T2V-1.3B-Diffusers](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B-Diffusers). It supports efficient 3-step inference and generates high-quality videos at **61×448×832** resolution. We adopt the [FastVideo 480P Synthetic Wan dataset](https://huggingface.co/datasets/FastVideo/Wan-Syn_77x448x832_600k), consisting of 600k synthetic latents.
+---
 ## Model Overview
+- 3-step inference is supported and achieves up to **20 FPS** on a single **H100** GPU.
+- Supports generating videos with resolution **61×448×832**.
+- Finetuning and inference scripts are available in the [FastVideo](https://github.com/hao-ai-lab/FastVideo) repository:
+  - [Finetuning script](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/distill/v1_distill_dmd_wan_VSA.sh)
+  - [Inference script](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/inference/v1_inference_wan_dmd.sh)
 - Try it out on **FastVideo** — we support a wide range of GPUs from **H100** to **4090**, and even support **Mac** users!
+### Training Infrastructure
+Training was conducted on **4 nodes with 32 H200 GPUs** in total, using a `global batch size = 64`.
+We enable `gradient checkpointing`, set `gradient_accumulation_steps=2`, and use `learning rate = 1e-5`.
+We set **VSA attention sparsity** to 0.8, and training runs for **4000 steps (~12 hours)**
+Training example script is available [here](https://github.com/hao-ai-lab/FastVideo/blob/main/examples/distill/Wan-Syn-480P/distill_dmd_VSA_t2v.slurm).
+If you use the FastWan2.1-T2V-1.3B-Diffusers model for your research, please cite our paper:
 ```
 @article{zhang2025vsa,
   title={VSA: Faster Video Diffusion with Trainable Sparse Attention},