MCM-Simplified

- Prompt
- Pancakes with chocolate syrup nuts and bananas stack of whole flapjack tasty breakfast
Model description
license: apache-2.0 tags: - text-to-video - motion-consistency - distillation
Motion Consistency Model - Simplified Implementation
This model is a distilled version of the Motion Consistency Model, trained on a subset of WebVid 2M with additional filtered image-caption pairs from the LAION aesthetic dataset.
Sample Generated Videos
Caption | Teacher (ModelScope) - 50 DDIM Steps | Student (First Setup) - 4 Steps | Student (Second Setup) - 4 Steps |
---|---|---|---|
Worker slicing a piece of meat. | |||
Pancakes with chocolate syrup, nuts, and bananas. |
Training Details
- Dataset: 3022 video-caption pairs from WebVid 2M
- Image Pairs:
- Setup 1: 20K filtered LAION aesthetic images (min. resolution 450×450)
- Setup 2: 7.5K filtered LAION aesthetic images (min. resolution 1024×1024)
Training Configurations
Setup 1
- LR: 5e-6, Grad Accum: 4, Max Grad Norm: 10
- Discriminator LR: 5e-5, Weight: 1, Lambda R1: 1e-5
- EMA Decay: 0.95, Epochs: 7, Steps: ~5100
Setup 2 (Modified)
- LR: 2e-6, Grad Accum: 16, Max Grad Norm: 5
- Discriminator LR: 1e-6, Weight: 0.5, Lambda R1: 1e-4
- EMA Decay: 0.98, LR Warmup: 300 steps, Epochs: 10
Evaluation
Frechet Video Distance (FVD)
Model | 1 Step | 2 Steps | 4 Steps | 8 Steps |
---|---|---|---|---|
Teacher (50 DDIM Steps) | 2954.77 | - | - | - |
Student - Setup 1 | 2598.15 | 2684.24 | 3082.84 | 3914.78 |
Student - Setup 2 | 2589.01 | 3053.35 | 3284.69 | 3930.07 |
CLIP Similarity (×100)
Model | 1 Step | 2 Steps | 4 Steps | 8 Steps |
---|---|---|---|---|
Teacher (50 DDIM Steps) | 27.88 | - | - | - |
Student - Setup 1 | 22.55 | 25.62 | 26.86 | 27.01 |
Student - Setup 2 | 20.13 | 23.41 | 25.31 | 24.62 |
Conclusion
Setup 2 was modified to stabilize training and prevent the discriminator from overpowering the generator. The changes improved FVD scores for 1-step inference, while multi-step performance varied. CLIP similarity improved across multiple inference steps, indicating better text-to-video alignment.
References
Original Implementation: Motion Consistency Model
Download model
Weights for this model are available in Safetensors format.
Download them in the Files & versions tab.
- Downloads last month
- 19
Model tree for SepehrNoey/MCM-Simplified
Base model
ali-vilab/text-to-video-ms-1.7b