--- library_name: transformers tags: [] --- # Model Card for Model ID ## Model Details ### Model Description V-JEPA 2 is a self-supervised video backbone trained on >1 M h of internet video; Meta released checkpoints with a Something-Something v2 action head. I freeze that backbone and fine-tune only the classifier head on the HMDB-51 benchmark (6 766 clips, 51 classes) for 5 epochs. The resulting model reaches competitive top-1 accuracy (see Evaluation) - **Developed by:** Sujit Shelar - **Funded by :** self-funded (personal compute credits) - **Shared by :** V-JEPA 2 ViT-Large (16 frame, 256² patch) video-encoder with a 51-way classification head - **Model type:** Vision (video); no text inputs - **Language(s) (NLP) :** [More Information Needed] - **License :** MIT – identical to the upstream V-JEPA 2 weights - **Finetuned from model :** facebook/vjepa2-vitl-fpc16-256-ssv2 ### Model Sources [optional] - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses ### Direct Use Rapid benchmarking or research on human-action recognition in academic settings. Feature extractor for video retrieval or robotics perception pipelines. ### Downstream Use [optional] Starting point for further fine-tuning on custom action datasets (e.g. UCF-101). ### Out-of-Scope Use Any safety-critical decision-making (medical, legal, real-time surveillance). Generation or captioning tasks – the model outputs only class logits. ## Bias, Risks, and Limitations HMDB-51 clips come largely from Hollywood movies and internet videos, so actions, environments and demographics are skewed towards Western-centric visual culture. The small dataset size (≈6 k clips) may lead to over-fitting and poor generalisation to unseen domains. Users should not rely on predictions for sensitive applications without additional validation. ### Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. ## How to Get Started with the Model Use the code below to get started with the model. ```python from transformers import AutoVideoProcessor, AutoModelForVideoClassification import torch, torchvision model_id = "SujitShelar/vjepa2-vitl-fpc16-256-hmdb51" processor = AutoVideoProcessor.from_pretrained(model_id) model = AutoModelForVideoClassification.from_pretrained(model_id) # sample one 5-sec clip with torchvision.io or torchcodec, shape (T,C,H,W) video = torch.randn(16, 3, 256, 256) # dummy tensor inputs = processor(video.unsqueeze(0), return_tensors="pt") logits = model(**inputs).logits print(model.config.id2label[logits.argmax(-1).item()]) ``` ## Training Details ### Training Data HMDB-51 (CC BY-4.0, 6 766 clips across 51 classes). I stratify 70 / 15 / 15 % into train/val/test (4 736 / 1 015 / 1 015 clips). ### Training Procedure | | value | | ---------------- | ------------------------------------------------------ | | Frozen layers | all V-JEPA 2 backbone blocks | | Trainable params | 1.2 M (classification head) | | Epochs | 5 | | Effective batch | 16 (physical 4 × grad-accum 4) | | Optimiser | Adam (lr 1 e-5) | | Augmentations | RandomResizedCrop 256², RandomHorizontalFlip | | Hardware | 1× nvidia-a100-80gb | #### Preprocessing Clips are sampled at 16 frames per video (torchcodec.clips_at_random_indices), resized/cropped to 256², then normalised by the processor. #### Training Hyperparameters - **Training regime:** [More Information Needed] #### Speeds, Sizes, Times [optional] [More Information Needed] ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data [More Information Needed] #### Factors [More Information Needed] #### Metrics | Metric | Definition | Why we use it | | --------------------------- | -------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Top-1 accuracy** | Percentage of videos for which the *predicted* class label exactly matches the **single ground-truth action**. | HMDB-51 is a 51-way closed-set task; the community almost exclusively quotes Top-1, making our scores directly comparable to prior work. | | *(optional)* Top-5 accuracy | Video is considered correct if the ground-truth label appears in the five highest-probability classes. | Helpful when the correct class is semantically close to others (e.g. *run* vs *walk*), but **not reported here** to keep the head-only baseline in line with earlier papers. | Evaluation protocol Split-1 of HMDB-51 (the canonical 70 / 15 / 15 % stratified split) is used for validation during training and for final test reporting. We sample one 16-frame clip per video at 256 × 256 resolution and apply a single-crop evaluation, following the JEPA model card. This produces a 5-D tensor (B, T, C, H, W) that the VJEPA2VideoProcessor converts to model inputs. Accuracy is averaged over the full validation or test set; no class weighting is applied. ### Results | Split | Epochs | Top-1 accuracy | | ------------------------------- | :----: | :-----------------: | | Validation | 1 → 5 | 14.2 % → **41.9 %** | | Test (single-crop, single-clip) | — | **42.9 %** | Numbers come from the run shown in the training logs (runs/vjepa2_hmdb51). How it compares | Method (ViT-L backbone unless noted) | Trainable params | Clips / crops at test | HMDB-51 Top-1 | | ------------------------------------- | ---------------- | --------------------- | ------------------------------------ | | **This work – head-only JEPA-L** | 1 M (0.3 %) | 1 ✕ 1 | **42.9 %** | | Linear probe VideoMAE-B | 0.1 % | 1 ✕ 1 | 38.9 % ([arxiv.org][1]) | | Linear probe TimeSformer-B-IN pt | full-frozen | 3 ✕ 10 | 42.9 % (val) ([github.com][2]) | | **AdaptFormer** (last-block adapters) | 1 % | 1 ✕ 1 | 46.1 % ([proceedings.neurips.cc][3]) | | **CVPT** visual-prompt tuning | <1 % | 3 ✕ 10 | 57 % | | Full fine-tune TimeSformer-B | 100 % | 3 ✕ 10 | 64 % ([proceedings.neurips.cc][3]) | | Full fine-tune VideoMAE-B | 100 % | 3 ✕ 10 | 73 % ([arxiv.org][1]) | | **VideoMAE V2-G (giant)** | 100 % | 3 ✕ 10 | 86 % ([arxiv.org][4]) | | InBrwSANet (CNN + SA) | 100 % | 3 ✕ 10 | 77 % ([researchgate.net][5]) | [1]: https://arxiv.org/pdf/2203.12602 "[PDF] VideoMAE: Masked Autoencoders are Data-Efficient ... - arXiv" [2]: https://github.com/facebookresearch/TimeSformer/issues/19 "About UCF101 and HMDB51 results · Issue #19 - GitHub" [3]: https://proceedings.neurips.cc/paper_files/paper/2022/file/69e2f49ab0837b71b0e0cb7c555990f8-Paper-Conference.pdf "[PDF] Adapting Vision Transformers for Scalable Visual Recognition" [4]: https://arxiv.org/html/2402.08875v4 "Advancing Human Action Recognition with Foundation Models ..." [5]: https://www.researchgate.net/publication/392129870_InBRwSANet_Self-attention_based_parallel_inverted_residual_bottleneck_architecture_for_human_action_recognition_in_smart_cities "(PDF) InBRwSANet: Self-attention based parallel inverted residual ..." Take-away 42 – 43 % is in the upper range of published “backbone-frozen” baselines; unlocking a few transformer blocks, adding LoRA / prompt adapters, or running a full fine-tune typically raises HMDB-51 accuracy into the 55 – 70 % bracket. See the Bias, Risks & Limitations and Recommendations sections for caveats and upgrade suggestions. #### Summary ## Model Examination [optional] [More Information Needed] ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** [More Information Needed] - **Hours used:** [More Information Needed] - **Cloud Provider:** [More Information Needed] - **Compute Region:** [More Information Needed] - **Carbon Emitted:** [More Information Needed] ## Technical Specifications ### Model Architecture and Objective ViT-Large (307 M params backbone) within the V-JEPA 2 framework. 16 × 16 image patches over 256² input; 16-frame temporal tube. Classification head: two MLP layers (hidden 4 096 → 51 classes). ### Compute Infrastructure [More Information Needed] #### Hardware [More Information Needed] #### Software [More Information Needed] ## Citation **BibTeX:** @misc{shelar2025vjepa2hmdb51, title = {V-JEPA2 ViT-L fine-tuned on HMDB-51}, author = {Sujit Shelar}, year = {2025}, howpublished = {\url{https://huggingface.co/SujitShelar/vjepa2-vitl-fpc16-256-hmdb51}}, note = {Fine-tuned from Assran et al. (2025) V-JEPA 2.} } **APA:** [More Information Needed] ## Glossary [optional] [More Information Needed] ## More Information [optional] [More Information Needed] ## Model Card Authors [optional] [More Information Needed] ## Model Card Contact [More Information Needed]