|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
V-JEPA 2 is a self-supervised video backbone trained on >1 M h of internet video; Meta released checkpoints with a Something-Something v2 action head. I freeze that backbone and fine-tune only the classifier head on the HMDB-51 benchmark (6 766 clips, 51 classes) for 5 epochs. The resulting model reaches competitive top-1 accuracy (see Evaluation) |
|
|
|
- **Developed by:** Sujit Shelar |
|
- **Funded by :** self-funded (personal compute credits) |
|
- **Shared by :** V-JEPA 2 ViT-Large (16 frame, 256² patch) video-encoder with a 51-way classification head |
|
- **Model type:** Vision (video); no text inputs |
|
- **Language(s) (NLP) :** [More Information Needed] |
|
- **License :** MIT – identical to the upstream V-JEPA 2 weights |
|
- **Finetuned from model :** facebook/vjepa2-vitl-fpc16-256-ssv2 |
|
|
|
### Model Sources [optional] |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** [More Information Needed] |
|
- **Paper [optional]:** [More Information Needed] |
|
- **Demo [optional]:** [More Information Needed] |
|
|
|
## Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
### Direct Use |
|
|
|
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
|
|
|
Rapid benchmarking or research on human-action recognition in academic settings. |
|
|
|
Feature extractor for video retrieval or robotics perception pipelines. |
|
|
|
### Downstream Use [optional] |
|
|
|
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app --> |
|
|
|
Starting point for further fine-tuning on custom action datasets (e.g. UCF-101). |
|
|
|
### Out-of-Scope Use |
|
|
|
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. --> |
|
|
|
Any safety-critical decision-making (medical, legal, real-time surveillance). |
|
|
|
Generation or captioning tasks – the model outputs only class logits. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
|
|
HMDB-51 clips come largely from Hollywood movies and internet videos, so actions, environments and demographics are skewed towards Western-centric visual culture. The small dataset size (≈6 k clips) may lead to over-fitting and poor generalisation to unseen domains. Users should not rely on predictions for sensitive applications without additional validation. |
|
|
|
### Recommendations |
|
|
|
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> |
|
|
|
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
```python |
|
from transformers import AutoVideoProcessor, AutoModelForVideoClassification |
|
import torch, torchvision |
|
|
|
model_id = "SujitShelar/vjepa2-vitl-fpc16-256-hmdb51" |
|
processor = AutoVideoProcessor.from_pretrained(model_id) |
|
model = AutoModelForVideoClassification.from_pretrained(model_id) |
|
|
|
# sample one 5-sec clip with torchvision.io or torchcodec, shape (T,C,H,W) |
|
video = torch.randn(16, 3, 256, 256) # dummy tensor |
|
|
|
inputs = processor(video.unsqueeze(0), return_tensors="pt") |
|
logits = model(**inputs).logits |
|
print(model.config.id2label[logits.argmax(-1).item()]) |
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
HMDB-51 (CC BY-4.0, 6 766 clips across 51 classes). I stratify 70 / 15 / 15 % into train/val/test (4 736 / 1 015 / 1 015 clips). |
|
|
|
### Training Procedure |
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
|
| | value | |
|
| ---------------- | ------------------------------------------------------ | |
|
| Frozen layers | all V-JEPA 2 backbone blocks | |
|
| Trainable params | 1.2 M (classification head) | |
|
| Epochs | 5 | |
|
| Effective batch | 16 (physical 4 × grad-accum 4) | |
|
| Optimiser | Adam (lr 1 e-5) | |
|
| Augmentations | RandomResizedCrop 256², RandomHorizontalFlip | |
|
| Hardware | 1× nvidia-a100-80gb | |
|
|
|
|
|
#### Preprocessing |
|
|
|
Clips are sampled at 16 frames per video (torchcodec.clips_at_random_indices), resized/cropped to 256², then normalised by the processor. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision --> |
|
|
|
#### Speeds, Sizes, Times [optional] |
|
|
|
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. --> |
|
|
|
[More Information Needed] |
|
|
|
## Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
<!-- This should link to a Dataset Card if possible. --> |
|
|
|
[More Information Needed] |
|
|
|
#### Factors |
|
|
|
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. --> |
|
|
|
[More Information Needed] |
|
|
|
#### Metrics |
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
|
|
| Metric | Definition | Why we use it | |
|
| --------------------------- | -------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
|
| **Top-1 accuracy** | Percentage of videos for which the *predicted* class label exactly matches the **single ground-truth action**. | HMDB-51 is a 51-way closed-set task; the community almost exclusively quotes Top-1, making our scores directly comparable to prior work. | |
|
| *(optional)* Top-5 accuracy | Video is considered correct if the ground-truth label appears in the five highest-probability classes. | Helpful when the correct class is semantically close to others (e.g. *run* vs *walk*), but **not reported here** to keep the head-only baseline in line with earlier papers. | |
|
|
|
Evaluation protocol |
|
|
|
Split-1 of HMDB-51 (the canonical 70 / 15 / 15 % stratified split) is used for validation during training and for final test reporting. |
|
|
|
We sample one 16-frame clip per video at 256 × 256 resolution and apply a single-crop evaluation, following the JEPA model card. This produces a 5-D tensor (B, T, C, H, W) that the VJEPA2VideoProcessor converts to model inputs. |
|
|
|
Accuracy is averaged over the full validation or test set; no class weighting is applied. |
|
|
|
### Results |
|
|
|
| Split | Epochs | Top-1 accuracy | |
|
| ------------------------------- | :----: | :-----------------: | |
|
| Validation | 1 → 5 | 14.2 % → **41.9 %** | |
|
| Test (single-crop, single-clip) | — | **42.9 %** | |
|
|
|
<sub>Numbers come from the run shown in the training logs (runs/vjepa2_hmdb51).</sub> |
|
|
|
How it compares |
|
| Method (ViT-L backbone unless noted) | Trainable params | Clips / crops at test | HMDB-51 Top-1 | |
|
| ------------------------------------- | ---------------- | --------------------- | ------------------------------------ | |
|
| **This work – head-only JEPA-L** | 1 M (0.3 %) | 1 ✕ 1 | **42.9 %** | |
|
| Linear probe VideoMAE-B | 0.1 % | 1 ✕ 1 | 38.9 % ([arxiv.org][1]) | |
|
| Linear probe TimeSformer-B-IN pt | full-frozen | 3 ✕ 10 | 42.9 % (val) ([github.com][2]) | |
|
| **AdaptFormer** (last-block adapters) | 1 % | 1 ✕ 1 | 46.1 % ([proceedings.neurips.cc][3]) | |
|
| **CVPT** visual-prompt tuning | <1 % | 3 ✕ 10 | 57 % | |
|
| Full fine-tune TimeSformer-B | 100 % | 3 ✕ 10 | 64 % ([proceedings.neurips.cc][3]) | |
|
| Full fine-tune VideoMAE-B | 100 % | 3 ✕ 10 | 73 % ([arxiv.org][1]) | |
|
| **VideoMAE V2-G (giant)** | 100 % | 3 ✕ 10 | 86 % ([arxiv.org][4]) | |
|
| InBrwSANet (CNN + SA) | 100 % | 3 ✕ 10 | 77 % ([researchgate.net][5]) | |
|
|
|
[1]: https://arxiv.org/pdf/2203.12602 "[PDF] VideoMAE: Masked Autoencoders are Data-Efficient ... - arXiv" |
|
[2]: https://github.com/facebookresearch/TimeSformer/issues/19 "About UCF101 and HMDB51 results · Issue #19 - GitHub" |
|
[3]: https://proceedings.neurips.cc/paper_files/paper/2022/file/69e2f49ab0837b71b0e0cb7c555990f8-Paper-Conference.pdf "[PDF] Adapting Vision Transformers for Scalable Visual Recognition" |
|
[4]: https://arxiv.org/html/2402.08875v4 "Advancing Human Action Recognition with Foundation Models ..." |
|
[5]: https://www.researchgate.net/publication/392129870_InBRwSANet_Self-attention_based_parallel_inverted_residual_bottleneck_architecture_for_human_action_recognition_in_smart_cities "(PDF) InBRwSANet: Self-attention based parallel inverted residual ..." |
|
|
|
|
|
Take-away |
|
|
|
42 – 43 % is in the upper range of published “backbone-frozen” baselines; unlocking a few transformer blocks, adding LoRA / prompt adapters, or running a full fine-tune typically raises HMDB-51 accuracy into the 55 – 70 % bracket. See the Bias, Risks & Limitations and Recommendations sections for caveats and upgrade suggestions. |
|
|
|
#### Summary |
|
|
|
|
|
|
|
## Model Examination [optional] |
|
|
|
<!-- Relevant interpretability work for the model goes here --> |
|
|
|
[More Information Needed] |
|
|
|
## Environmental Impact |
|
|
|
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly --> |
|
|
|
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). |
|
|
|
- **Hardware Type:** [More Information Needed] |
|
- **Hours used:** [More Information Needed] |
|
- **Cloud Provider:** [More Information Needed] |
|
- **Compute Region:** [More Information Needed] |
|
- **Carbon Emitted:** [More Information Needed] |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
|
|
ViT-Large (307 M params backbone) within the V-JEPA 2 framework. |
|
|
|
16 × 16 image patches over 256² input; 16-frame temporal tube. |
|
|
|
Classification head: two MLP layers (hidden 4 096 → 51 classes). |
|
|
|
### Compute Infrastructure |
|
|
|
[More Information Needed] |
|
|
|
#### Hardware |
|
|
|
[More Information Needed] |
|
|
|
#### Software |
|
|
|
[More Information Needed] |
|
|
|
## Citation |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
**BibTeX:** |
|
|
|
@misc{shelar2025vjepa2hmdb51, |
|
title = {V-JEPA2 ViT-L fine-tuned on HMDB-51}, |
|
author = {Sujit Shelar}, |
|
year = {2025}, |
|
howpublished = {\url{https://huggingface.co/SujitShelar/vjepa2-vitl-fpc16-256-hmdb51}}, |
|
note = {Fine-tuned from Assran et al. (2025) V-JEPA 2.} |
|
} |
|
|
|
|
|
**APA:** |
|
|
|
[More Information Needed] |
|
|
|
## Glossary [optional] |
|
|
|
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. --> |
|
|
|
[More Information Needed] |
|
|
|
## More Information [optional] |
|
|
|
[More Information Needed] |
|
|
|
## Model Card Authors [optional] |
|
|
|
[More Information Needed] |
|
|
|
## Model Card Contact |
|
|
|
[More Information Needed] |