SujitShelar's picture
Update README.md
8a92788 verified
---
library_name: transformers
tags: []
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
V-JEPA 2 is a self-supervised video backbone trained on >1 M h of internet video; Meta released checkpoints with a Something-Something v2 action head. I freeze that backbone and fine-tune only the classifier head on the HMDB-51 benchmark (6 766 clips, 51 classes) for 5 epochs. The resulting model reaches competitive top-1 accuracy (see Evaluation)
- **Developed by:** Sujit Shelar
- **Funded by :** self-funded (personal compute credits)
- **Shared by :** V-JEPA 2 ViT-Large (16 frame, 256² patch) video-encoder with a 51-way classification head
- **Model type:** Vision (video); no text inputs
- **Language(s) (NLP) :** [More Information Needed]
- **License :** MIT – identical to the upstream V-JEPA 2 weights
- **Finetuned from model :** facebook/vjepa2-vitl-fpc16-256-ssv2
### Model Sources [optional]
<!-- Provide the basic links for the model. -->
- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
- **Demo [optional]:** [More Information Needed]
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
### Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
Rapid benchmarking or research on human-action recognition in academic settings.
Feature extractor for video retrieval or robotics perception pipelines.
### Downstream Use [optional]
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
Starting point for further fine-tuning on custom action datasets (e.g. UCF-101).
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
Any safety-critical decision-making (medical, legal, real-time surveillance).
Generation or captioning tasks – the model outputs only class logits.
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
HMDB-51 clips come largely from Hollywood movies and internet videos, so actions, environments and demographics are skewed towards Western-centric visual culture. The small dataset size (≈6 k clips) may lead to over-fitting and poor generalisation to unseen domains. Users should not rely on predictions for sensitive applications without additional validation.
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
## How to Get Started with the Model
Use the code below to get started with the model.
```python
from transformers import AutoVideoProcessor, AutoModelForVideoClassification
import torch, torchvision
model_id = "SujitShelar/vjepa2-vitl-fpc16-256-hmdb51"
processor = AutoVideoProcessor.from_pretrained(model_id)
model = AutoModelForVideoClassification.from_pretrained(model_id)
# sample one 5-sec clip with torchvision.io or torchcodec, shape (T,C,H,W)
video = torch.randn(16, 3, 256, 256) # dummy tensor
inputs = processor(video.unsqueeze(0), return_tensors="pt")
logits = model(**inputs).logits
print(model.config.id2label[logits.argmax(-1).item()])
```
## Training Details
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
HMDB-51 (CC BY-4.0, 6 766 clips across 51 classes). I stratify 70 / 15 / 15 % into train/val/test (4 736 / 1 015 / 1 015 clips).
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
| | value |
| ---------------- | ------------------------------------------------------ |
| Frozen layers | all V-JEPA 2 backbone blocks |
| Trainable params | 1.2 M (classification head) |
| Epochs | 5 |
| Effective batch | 16 (physical 4 × grad-accum 4) |
| Optimiser | Adam (lr 1 e-5) |
| Augmentations | RandomResizedCrop 256², RandomHorizontalFlip |
| Hardware | 1× nvidia-a100-80gb |
#### Preprocessing
Clips are sampled at 16 frames per video (torchcodec.clips_at_random_indices), resized/cropped to 256², then normalised by the processor.
#### Training Hyperparameters
- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
#### Speeds, Sizes, Times [optional]
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
[More Information Needed]
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
### Testing Data, Factors & Metrics
#### Testing Data
<!-- This should link to a Dataset Card if possible. -->
[More Information Needed]
#### Factors
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
[More Information Needed]
#### Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
| Metric | Definition | Why we use it |
| --------------------------- | -------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Top-1 accuracy** | Percentage of videos for which the *predicted* class label exactly matches the **single ground-truth action**. | HMDB-51 is a 51-way closed-set task; the community almost exclusively quotes Top-1, making our scores directly comparable to prior work. |
| *(optional)* Top-5 accuracy | Video is considered correct if the ground-truth label appears in the five highest-probability classes. | Helpful when the correct class is semantically close to others (e.g. *run* vs *walk*), but **not reported here** to keep the head-only baseline in line with earlier papers. |
Evaluation protocol
Split-1 of HMDB-51 (the canonical 70 / 15 / 15 % stratified split) is used for validation during training and for final test reporting.
We sample one 16-frame clip per video at 256 × 256 resolution and apply a single-crop evaluation, following the JEPA model card. This produces a 5-D tensor (B, T, C, H, W) that the VJEPA2VideoProcessor converts to model inputs.
Accuracy is averaged over the full validation or test set; no class weighting is applied.
### Results
| Split | Epochs | Top-1 accuracy |
| ------------------------------- | :----: | :-----------------: |
| Validation | 1 → 5 | 14.2 % → **41.9 %** |
| Test (single-crop, single-clip) | — | **42.9 %** |
<sub>Numbers come from the run shown in the training logs (runs/vjepa2_hmdb51).</sub>
How it compares
| Method (ViT-L backbone unless noted) | Trainable params | Clips / crops at test | HMDB-51 Top-1 |
| ------------------------------------- | ---------------- | --------------------- | ------------------------------------ |
| **This work – head-only JEPA-L** | 1 M (0.3 %) | 1 ✕ 1 | **42.9 %** |
| Linear probe VideoMAE-B | 0.1 % | 1 ✕ 1 | 38.9 % ([arxiv.org][1]) |
| Linear probe TimeSformer-B-IN pt | full-frozen | 3 ✕ 10 | 42.9 % (val) ([github.com][2]) |
| **AdaptFormer** (last-block adapters) | 1 % | 1 ✕ 1 | 46.1 % ([proceedings.neurips.cc][3]) |
| **CVPT** visual-prompt tuning | <1 % | 3 ✕ 10 | 57 % |
| Full fine-tune TimeSformer-B | 100 % | 3 ✕ 10 | 64 % ([proceedings.neurips.cc][3]) |
| Full fine-tune VideoMAE-B | 100 % | 3 ✕ 10 | 73 % ([arxiv.org][1]) |
| **VideoMAE V2-G (giant)** | 100 % | 3 ✕ 10 | 86 % ([arxiv.org][4]) |
| InBrwSANet (CNN + SA) | 100 % | 3 ✕ 10 | 77 % ([researchgate.net][5]) |
[1]: https://arxiv.org/pdf/2203.12602 "[PDF] VideoMAE: Masked Autoencoders are Data-Efficient ... - arXiv"
[2]: https://github.com/facebookresearch/TimeSformer/issues/19 "About UCF101 and HMDB51 results · Issue #19 - GitHub"
[3]: https://proceedings.neurips.cc/paper_files/paper/2022/file/69e2f49ab0837b71b0e0cb7c555990f8-Paper-Conference.pdf "[PDF] Adapting Vision Transformers for Scalable Visual Recognition"
[4]: https://arxiv.org/html/2402.08875v4 "Advancing Human Action Recognition with Foundation Models ..."
[5]: https://www.researchgate.net/publication/392129870_InBRwSANet_Self-attention_based_parallel_inverted_residual_bottleneck_architecture_for_human_action_recognition_in_smart_cities "(PDF) InBRwSANet: Self-attention based parallel inverted residual ..."
Take-away
42 – 43 % is in the upper range of published “backbone-frozen” baselines; unlocking a few transformer blocks, adding LoRA / prompt adapters, or running a full fine-tune typically raises HMDB-51 accuracy into the 55 – 70 % bracket. See the Bias, Risks & Limitations and Recommendations sections for caveats and upgrade suggestions.
#### Summary
## Model Examination [optional]
<!-- Relevant interpretability work for the model goes here -->
[More Information Needed]
## Environmental Impact
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** [More Information Needed]
- **Hours used:** [More Information Needed]
- **Cloud Provider:** [More Information Needed]
- **Compute Region:** [More Information Needed]
- **Carbon Emitted:** [More Information Needed]
## Technical Specifications
### Model Architecture and Objective
ViT-Large (307 M params backbone) within the V-JEPA 2 framework.
16 × 16 image patches over 256² input; 16-frame temporal tube.
Classification head: two MLP layers (hidden 4 096 → 51 classes).
### Compute Infrastructure
[More Information Needed]
#### Hardware
[More Information Needed]
#### Software
[More Information Needed]
## Citation
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
@misc{shelar2025vjepa2hmdb51,
title = {V-JEPA2 ViT-L fine-tuned on HMDB-51},
author = {Sujit Shelar},
year = {2025},
howpublished = {\url{https://huggingface.co/SujitShelar/vjepa2-vitl-fpc16-256-hmdb51}},
note = {Fine-tuned from Assran et al. (2025) V-JEPA 2.}
}
**APA:**
[More Information Needed]
## Glossary [optional]
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
[More Information Needed]
## More Information [optional]
[More Information Needed]
## Model Card Authors [optional]
[More Information Needed]
## Model Card Contact
[More Information Needed]