FastWan2.1-T2V-14B-Diffusers / README.md

Update README.md

ab7d614 verified 18 days ago

4.79 kB

	---
	license: apache-2.0
	datasets:
	- FastVideo/Wan-Syn_77x448x832_600k
	base_model:
	- Wan-AI/Wan2.1-T2V-14B-Diffusers
	---
	# FastVideo FastWan2.1-T2V-14B-480P-Diffusers
	<p align="center">
	<img src="https://raw.githubusercontent.com/hao-ai-lab/FastVideo/main/assets/logo.png" width="200"/>
	</p>
	<div>
	<div align="center">
	<a href="https://github.com/hao-ai-lab/FastVideo" target="_blank">FastVideo Team</a>&emsp;
	</div>

	<div align="center">
	<a href="https://arxiv.org/pdf/2505.13389">Paper</a> \|
	<a href="https://github.com/hao-ai-lab/FastVideo">Github</a>
	</div>
	</div>



	## Introduction

	We're excited to introduce the FastWan2.1 series—a new line of models finetuned with our novel Sparse-distill strategy. This approach jointly integrates DMD and VSA in 1 single training process, combining the benefits of both distillation to shorten diffusion steps and sparse attention to reduce attention computations, enabling even faster video generation.

	FastWan2.1-T2V-14B-480P-Diffuserss is built upon Wan-AI/Wan2.1-T2V-14B-Diffusers. It supports efficient 3-step inference and produces high-quality videos at 61×448×832 resolution. For training, we use the FastVideo 480P Synthetic Wan dataset, which contains 600k synthetic latents.


	## Model Overview

	- 3-step inference is supported and achieves up to 50x speed up at 480P, 70x speed up at 720P during denoising loop on a single H100 GPU.
	- Our model is trained on 61×448×832 resolution, but it supports generating videos with any resolution.(480P, 720P, quality may degrade)
	- Finetuning and inference scripts are available in the [FastVideo](https://github.com/hao-ai-lab/FastVideo) repository:
	- [1 Node/GPU debugging finetuning script](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/distill/v1_distill_dmd_wan_VSA.sh)
	- [Slurm training example script](https://github.com/hao-ai-lab/FastVideo/blob/main/examples/distill/Wan2.1-T2V/Wan-Syn-Data-480P/distill_dmd_VSA_t2v_14B.slurm)
	- Inference script in FastVideo:
	```python
	#!/bin/bash

	# install FastVideo and VSA first
	git clone https://github.com/hao-ai-lab/FastVideo
	pip install -e .
	cd csrc/attn
	git submodule update --init --recursive
	python setup_vsa.py install

	num_gpus=1
	export FASTVIDEO_ATTENTION_BACKEND=VIDEO_SPARSE_ATTN
	export MODEL_BASE=FastVideo/FastWan2.1-T2V-14B-480P-Diffusers

	# 720P 14B
	# Torch compile is enabled. Expect generating the first video to be slow.
	# Speed on H200 after warmup 3/3 [00:13<00:00, 4.45s/it]:
	num_gpus=1
	export FASTVIDEO_ATTENTION_BACKEND=VIDEO_SPARSE_ATTN
	export MODEL_BASE=FastVideo/FastWan2.1-T2V-14B-480P-Diffusers
	# export MODEL_BASE=hunyuanvideo-community/HunyuanVideo
	# You can either use --prompt or --prompt-txt, but not both.
	fastvideo generate \
	--model-path $MODEL_BASE \
	--sp-size $num_gpus \
	--tp-size 1 \
	--num-gpus $num_gpus \
	--height 720 \
	--width 1280 \
	--num-frames 81 \
	--num-inference-steps 3 \
	--fps 16 \
	--prompt-txt assets/prompt.txt \
	--negative-prompt "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards" \
	--seed 1024 \
	--output-path outputs_video_dmd_14B_720P/ \
	--VSA-sparsity 0.9 \
	--dmd-denoising-steps "1000,757,522" \
	--enable_torch_compile
	```
	- Try it out on FastVideo — we support a wide range of GPUs from H100 to 4090, and also support Mac users!

	### Training Infrastructure

	Training was conducted on 8 nodes with 64 H200 GPUs in total, using a `global batch size = 64`.
	We enable `gradient checkpointing`, set `HSDP_shard_dim = 8`, `sequence_parallel_size = 4`, and use `learning rate = 1e-5`.
	We set VSA attention sparsity to 0.9, and training runs for 3000 steps (~52 hours)

	If you use FastWan2.1-T2V-14B-480P-Diffusers model for your research, please cite our paper:
	```
	@article{zhang2025vsa,
	title={VSA: Faster Video Diffusion with Trainable Sparse Attention},
	author={Zhang, Peiyuan and Huang, Haofeng and Chen, Yongqi and Lin, Will and Liu, Zhengzhong and Stoica, Ion and Xing, Eric and Zhang, Hao},
	journal={arXiv preprint arXiv:2505.13389},
	year={2025}
	}
	@article{zhang2025fast,
	title={Fast video generation with sliding tile attention},
	author={Zhang, Peiyuan and Chen, Yongqi and Su, Runlong and Ding, Hangliang and Stoica, Ion and Liu, Zhengzhong and Zhang, Hao},
	journal={arXiv preprint arXiv:2502.04507},
	year={2025}
	}
	```