Official PyTorch Implementation of SMILE (CVPR 2025).

SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning
Fida Mohammad Thoker, Letian Jiang, Chen Zhao, Bernard Ghanem
King Abdullah University of Science and Technology (KAUST)

📰 News

[2025.6.2] Code and pre-trained models are available now!
[2025.5.28] Code and pre-trained models will be released here. Welcome to watch this repository for the latest updates.

✨ Highlights

🔥 State-of-the-art on SSv2 and K400

Our method achieves state-of-the-art performance on SSv2 and K400 benchmarks with a ViT-B backbone, surpassing prior self-supervised video models by up to 2.5%, thanks to efficient CLIP-based semantic supervision.

⚡️ Leading Results Across Generalization Challenges

We evaluate our method on the SEVERE benchmark, covering domain shift, low-shot learning, fine-grained actions, and task adaptability. Our model consistently outperforms prior methods and achieves a 3.0% average gain over strong baselines, demonstrating superior generalization in diverse video understanding tasks.

😮 Superior Motion Representation Without Video-Text Alignment

Compared to CLIP-based methods such as ViCLIP and UMT, our model achieves higher accuracy on motion-sensitive datasets, particularly under linear probing. This indicates stronger video representations learned with less data and without relying on video-text alignment.

🚀 Main Results and Models

✨ Something-Something V2

Method	Pretrain Dataset	Pretrain Epochs	Backbone	Top-1	Finetune
SMILE	K400	800	ViT-S	69.1	TODO
SMILE	K400	600	ViT-B	72.1	log / checkpoint
SMILE	K400	1200	ViT-B	72.4	log / checkpoint
SMILE	SSv2	800	ViT-B	72.5	TODO

✨ Kinetics-400

Method	Pretrain Dataset	Pretrain Epochs	Backbone	Top-1	Pretrain	Finetune
SMILE	K400	800	ViT-S	79.5	TODO	TODO
SMILE	K400	600	ViT-B	83.1	checkpoint	log / checkpoint
SMILE	K400	1200	ViT-B	83.4	checkpoint	log / checkpoint

🔨 Installation

Please follow the instructions in INSTALL.md.

➡️ Data Preparation

We follow VideoMAE Data preparation to prepare our datasets (K400 and SSv2). Here we provide our annotation files for those two datasets: annotation_files. For pretraining, we use training sets (train.csv).

We provide the list of segmented object images used for pretraining in object_instances.txt. The images will be released later.

🔄 Pre-training

Following the VideoMAE pre-training guide, we provide scripts for pre-training on the Kinetics-400 (K400) dataset using the ViT-Base model: scripts/pretrain/

As described in the paper, we adopt a two-stage training strategy. Please refer to the script names to identify which stage to run.

If you wish to perform your own pre-training, make sure to update the following parameters in the scripts:

DATA_PATH: Path to your dataset
OUTPUT_DIR: Directory to save output results
OBJECTS_PATH: Path to the overlaying objects image dataset (image data to be released)
FIRST_STAGE_CKPT: Path to the ckpt from first stage pretraining ( for second stage training)

Note: Our pre-training experiments were conducted using 8 V100(32 GB) GPUs.

⤴️ Fine-tuning with Pre-trained Models

Following the VideoMAE finetuning guide, we provide scripts for fine-tuning on the Something-Something v2 (SSv2) and Kinetics-400 (K400) datasets using the ViT-Base model: scripts/finetune/

To perform your own fine-tuning, please update the following parameters in the script:

DATA_PATH: Path to your dataset
MODEL_PATH: Path to the pre-trained model
OUTPUT_DIR: Directory to save output results

Note: Our finetuning experiments were conducted using 4 V100(32 GB) GPUs.

☎️ Contact

Fida Mohammad Thoker: [email protected]

👍 Acknowledgements

We sincerely thank Michael Dorkenwald for providing the object image dataset that supports this work.
This project is built upon VideoMAE and tubelet-contrast. Thanks to the contributors of these great codebases.

🔒 License

This project is released under the MIT license. For more details, please refer to the LICENSE file.

✏️ Citation

If you think this project is helpful, please feel free to leave a star⭐️ and cite our paper:

@inproceedings{thoker2025smile,
  author    = {Thoker, Fida Mohammad and Jiang, Letian and Zhao, Chen and Ghanem, Bernard},
  title     = {SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning},
  journal   = {CVPR},
  year      = {2025},
}