---
library_name: transformers
tags: []
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->


## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

V-JEPA 2 is a self-supervised video backbone trained on >1 M h of internet video; Meta released checkpoints with a Something-Something v2 action head. I freeze that backbone and fine-tune only the classifier head on the HMDB-51 benchmark (6 766 clips, 51 classes) for 5 epochs. The resulting model reaches competitive top-1 accuracy (see Evaluation)

- **Developed by:** 	Sujit Shelar
- **Funded by :** 	self-funded (personal compute credits)
- **Shared by :** V-JEPA 2 ViT-Large (16 frame, 256² patch) video-encoder with a 51-way classification head
- **Model type:** 	Vision (video); no text inputs
- **Language(s) (NLP) :** [More Information Needed]
- **License :** MIT – identical to the upstream V-JEPA 2 weights
- **Finetuned from model :** facebook/vjepa2-vitl-fpc16-256-ssv2

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
- **Demo [optional]:** [More Information Needed]

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

Rapid benchmarking or research on human-action recognition in academic settings.

Feature extractor for video retrieval or robotics perception pipelines.

### Downstream Use [optional]

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

Starting point for further fine-tuning on custom action datasets (e.g. UCF-101).

### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

Any safety-critical decision-making (medical, legal, real-time surveillance).

Generation or captioning tasks – the model outputs only class logits.

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

HMDB-51 clips come largely from Hollywood movies and internet videos, so actions, environments and demographics are skewed towards Western-centric visual culture. The small dataset size (≈6 k clips) may lead to over-fitting and poor generalisation to unseen domains. Users should not rely on predictions for sensitive applications without additional validation. 

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

## How to Get Started with the Model

Use the code below to get started with the model.

```python
from transformers import AutoVideoProcessor, AutoModelForVideoClassification
import torch, torchvision

model_id = "SujitShelar/vjepa2-vitl-fpc16-256-hmdb51"
processor = AutoVideoProcessor.from_pretrained(model_id)
model     = AutoModelForVideoClassification.from_pretrained(model_id)

# sample one 5-sec clip with torchvision.io or torchcodec, shape (T,C,H,W)
video = torch.randn(16, 3, 256, 256)        # dummy tensor

inputs = processor(video.unsqueeze(0), return_tensors="pt")
logits = model(**inputs).logits
print(model.config.id2label[logits.argmax(-1).item()])
```


## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

HMDB-51 (CC BY-4.0, 6 766 clips across 51 classes). I stratify 70 / 15 / 15 % into train/val/test (4 736 / 1 015 / 1 015 clips). 

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

|                  | value                                                  |
| ---------------- | ------------------------------------------------------ |
| Frozen layers    | all V-JEPA 2 backbone blocks                           |
| Trainable params | 1.2 M (classification head)                            |
| Epochs           | 5                                                      |
| Effective batch  | 16 (physical 4 × grad-accum 4)                         |
| Optimiser        | Adam (lr 1 e-5)                                        |
| Augmentations    | RandomResizedCrop 256², RandomHorizontalFlip           |
| Hardware         | 1× nvidia-a100-80gb                                    |


#### Preprocessing

Clips are sampled at 16 frames per video (torchcodec.clips_at_random_indices), resized/cropped to 256², then normalised by the processor.


#### Training Hyperparameters

- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

#### Speeds, Sizes, Times [optional]

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

[More Information Needed]

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Testing Data, Factors & Metrics

#### Testing Data

<!-- This should link to a Dataset Card if possible. -->

[More Information Needed]

#### Factors

<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

[More Information Needed]

#### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

| Metric                      | Definition                                                                                                     | Why we use it                                                                                                                                                                |
| --------------------------- | -------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Top-1 accuracy**          | Percentage of videos for which the *predicted* class label exactly matches the **single ground-truth action**. | HMDB-51 is a 51-way closed-set task; the community almost exclusively quotes Top-1, making our scores directly comparable to prior work.                                     |
| *(optional)* Top-5 accuracy | Video is considered correct if the ground-truth label appears in the five highest-probability classes.         | Helpful when the correct class is semantically close to others (e.g. *run* vs *walk*), but **not reported here** to keep the head-only baseline in line with earlier papers. |

Evaluation protocol

Split-1 of HMDB-51 (the canonical 70 / 15 / 15 % stratified split) is used for validation during training and for final test reporting.

We sample one 16-frame clip per video at 256 × 256 resolution and apply a single-crop evaluation, following the JEPA model card. This produces a 5-D tensor (B, T, C, H, W) that the VJEPA2VideoProcessor converts to model inputs.

Accuracy is averaged over the full validation or test set; no class weighting is applied.

### Results

| Split                           | Epochs |    Top-1 accuracy   |
| ------------------------------- | :----: | :-----------------: |
| Validation                      |  1 → 5 | 14.2 % → **41.9 %** |
| Test (single-crop, single-clip) |    —   |      **42.9 %**     |

<sub>Numbers come from the run shown in the training logs (runs/vjepa2_hmdb51).</sub>

How it compares
| Method (ViT-L backbone unless noted)  | Trainable params | Clips / crops at test | HMDB-51 Top-1                        |
| ------------------------------------- | ---------------- | --------------------- | ------------------------------------ |
| **This work – head-only JEPA-L**      | 1 M (0.3 %)      | 1 ✕ 1                 | **42.9 %**                           |
| Linear probe VideoMAE-B               | 0.1 %            | 1 ✕ 1                 | 38.9 % ([arxiv.org][1])              |
| Linear probe TimeSformer-B-IN pt      | full-frozen      | 3 ✕ 10                | 42.9 % (val) ([github.com][2])       |
| **AdaptFormer** (last-block adapters) | 1 %              | 1 ✕ 1                 | 46.1 % ([proceedings.neurips.cc][3]) |
| **CVPT** visual-prompt tuning         | <1 %             | 3 ✕ 10                | 57 %                                 |
| Full fine-tune TimeSformer-B          | 100 %            | 3 ✕ 10                | 64 % ([proceedings.neurips.cc][3])   |
| Full fine-tune VideoMAE-B             | 100 %            | 3 ✕ 10                | 73 % ([arxiv.org][1])                |
| **VideoMAE V2-G (giant)**             | 100 %            | 3 ✕ 10                | 86 % ([arxiv.org][4])                |
| InBrwSANet (CNN + SA)                 | 100 %            | 3 ✕ 10                | 77 % ([researchgate.net][5])         |

[1]: https://arxiv.org/pdf/2203.12602 "[PDF] VideoMAE: Masked Autoencoders are Data-Efficient ... - arXiv"
[2]: https://github.com/facebookresearch/TimeSformer/issues/19 "About UCF101 and HMDB51 results · Issue #19 - GitHub"
[3]: https://proceedings.neurips.cc/paper_files/paper/2022/file/69e2f49ab0837b71b0e0cb7c555990f8-Paper-Conference.pdf "[PDF] Adapting Vision Transformers for Scalable Visual Recognition"
[4]: https://arxiv.org/html/2402.08875v4 "Advancing Human Action Recognition with Foundation Models ..."
[5]: https://www.researchgate.net/publication/392129870_InBRwSANet_Self-attention_based_parallel_inverted_residual_bottleneck_architecture_for_human_action_recognition_in_smart_cities "(PDF) InBRwSANet: Self-attention based parallel inverted residual ..."


Take-away

42 – 43 % is in the upper range of published “backbone-frozen” baselines; unlocking a few transformer blocks, adding LoRA / prompt adapters, or running a full fine-tune typically raises HMDB-51 accuracy into the 55 – 70 % bracket. See the Bias, Risks & Limitations and Recommendations sections for caveats and upgrade suggestions.

#### Summary


## Model Examination [optional]

<!-- Relevant interpretability work for the model goes here -->

[More Information Needed]

## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** [More Information Needed]
- **Hours used:** [More Information Needed]
- **Cloud Provider:** [More Information Needed]
- **Compute Region:** [More Information Needed]
- **Carbon Emitted:** [More Information Needed]

## Technical Specifications 

### Model Architecture and Objective

ViT-Large (307 M params backbone) within the V-JEPA 2 framework.

16 × 16 image patches over 256² input; 16-frame temporal tube.

Classification head: two MLP layers (hidden 4 096 → 51 classes).

### Compute Infrastructure

[More Information Needed]

#### Hardware

[More Information Needed]

#### Software

[More Information Needed]

## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

@misc{shelar2025vjepa2hmdb51,
  title        = {V-JEPA2 ViT-L fine-tuned on HMDB-51},
  author       = {Sujit Shelar},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/SujitShelar/vjepa2-vitl-fpc16-256-hmdb51}},
  note         = {Fine-tuned from Assran et al. (2025) V-JEPA 2.}
}


**APA:**

[More Information Needed]

## Glossary [optional]

<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

[More Information Needed]

## More Information [optional]

[More Information Needed]

## Model Card Authors [optional]

[More Information Needed]

## Model Card Contact

[More Information Needed]