MCM-Simplified

Prompt
Pancakes with chocolate syrup nuts and bananas stack of whole flapjack tasty breakfast

Model description

license: apache-2.0 tags: - text-to-video - motion-consistency - distillation

Motion Consistency Model - Simplified Implementation

This model is a distilled version of the Motion Consistency Model, trained on a subset of WebVid 2M with additional filtered image-caption pairs from the LAION aesthetic dataset.

Sample Generated Videos

Caption	Teacher (ModelScope) - 50 DDIM Steps	Student (First Setup) - 4 Steps	Student (Second Setup) - 4 Steps
Worker slicing a piece of meat.
Pancakes with chocolate syrup, nuts, and bananas.

Training Details

Dataset: 3022 video-caption pairs from WebVid 2M
Image Pairs:
- Setup 1: 20K filtered LAION aesthetic images (min. resolution 450×450)
- Setup 2: 7.5K filtered LAION aesthetic images (min. resolution 1024×1024)

Training Configurations

Setup 1

LR: 5e-6, Grad Accum: 4, Max Grad Norm: 10
Discriminator LR: 5e-5, Weight: 1, Lambda R1: 1e-5
EMA Decay: 0.95, Epochs: 7, Steps: ~5100

Setup 2 (Modified)

LR: 2e-6, Grad Accum: 16, Max Grad Norm: 5
Discriminator LR: 1e-6, Weight: 0.5, Lambda R1: 1e-4
EMA Decay: 0.98, LR Warmup: 300 steps, Epochs: 10

Evaluation

Frechet Video Distance (FVD)

Model	1 Step	2 Steps	4 Steps	8 Steps
Teacher (50 DDIM Steps)	2954.77	-	-	-
Student - Setup 1	2598.15	2684.24	3082.84	3914.78
Student - Setup 2	2589.01	3053.35	3284.69	3930.07

CLIP Similarity (×100)

Model	1 Step	2 Steps	4 Steps	8 Steps
Teacher (50 DDIM Steps)	27.88	-	-	-
Student - Setup 1	22.55	25.62	26.86	27.01
Student - Setup 2	20.13	23.41	25.31	24.62

Conclusion

Setup 2 was modified to stabilize training and prevent the discriminator from overpowering the generator. The changes improved FVD scores for 1-step inference, while multi-step performance varied. CLIP similarity improved across multiple inference steps, indicating better text-to-video alignment.

SepehrNoey
/

MCM-Simplified