vjepa2-vitl-fpc16-256-hmdb51 / README.md

Update README.md

8a92788 verified 2 months ago

12.6 kB

	---
	library_name: transformers
	tags: []
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->



	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	V-JEPA 2 is a self-supervised video backbone trained on >1 M h of internet video; Meta released checkpoints with a Something-Something v2 action head. I freeze that backbone and fine-tune only the classifier head on the HMDB-51 benchmark (6 766 clips, 51 classes) for 5 epochs. The resulting model reaches competitive top-1 accuracy (see Evaluation)

	- Developed by: Sujit Shelar
	- Funded by : self-funded (personal compute credits)
	- Shared by : V-JEPA 2 ViT-Large (16 frame, 256² patch) video-encoder with a 51-way classification head
	- Model type: Vision (video); no text inputs
	- Language(s) (NLP) : [More Information Needed]
	- License : MIT – identical to the upstream V-JEPA 2 weights
	- Finetuned from model : facebook/vjepa2-vitl-fpc16-256-ssv2

	### Model Sources [optional]

	<!-- Provide the basic links for the model. -->

	- Repository: [More Information Needed]
	- Paper [optional]: [More Information Needed]
	- Demo [optional]: [More Information Needed]

	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

	### Direct Use

	<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

	Rapid benchmarking or research on human-action recognition in academic settings.

	Feature extractor for video retrieval or robotics perception pipelines.

	### Downstream Use [optional]

	<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

	Starting point for further fine-tuning on custom action datasets (e.g. UCF-101).

	### Out-of-Scope Use

	<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

	Any safety-critical decision-making (medical, legal, real-time surveillance).

	Generation or captioning tasks – the model outputs only class logits.

	## Bias, Risks, and Limitations

	<!-- This section is meant to convey both technical and sociotechnical limitations. -->

	HMDB-51 clips come largely from Hollywood movies and internet videos, so actions, environments and demographics are skewed towards Western-centric visual culture. The small dataset size (≈6 k clips) may lead to over-fitting and poor generalisation to unseen domains. Users should not rely on predictions for sensitive applications without additional validation.

	### Recommendations

	<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

	Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	from transformers import AutoVideoProcessor, AutoModelForVideoClassification
	import torch, torchvision

	model_id = "SujitShelar/vjepa2-vitl-fpc16-256-hmdb51"
	processor = AutoVideoProcessor.from_pretrained(model_id)
	model = AutoModelForVideoClassification.from_pretrained(model_id)

	# sample one 5-sec clip with torchvision.io or torchcodec, shape (T,C,H,W)
	video = torch.randn(16, 3, 256, 256) # dummy tensor

	inputs = processor(video.unsqueeze(0), return_tensors="pt")
	logits = model(**inputs).logits
	print(model.config.id2label[logits.argmax(-1).item()])
	```


	## Training Details

	### Training Data

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	HMDB-51 (CC BY-4.0, 6 766 clips across 51 classes). I stratify 70 / 15 / 15 % into train/val/test (4 736 / 1 015 / 1 015 clips).

	### Training Procedure

	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

	\| \| value \|
	\| ---------------- \| ------------------------------------------------------ \|
	\| Frozen layers \| all V-JEPA 2 backbone blocks \|
	\| Trainable params \| 1.2 M (classification head) \|
	\| Epochs \| 5 \|
	\| Effective batch \| 16 (physical 4 × grad-accum 4) \|
	\| Optimiser \| Adam (lr 1 e-5) \|
	\| Augmentations \| RandomResizedCrop 256², RandomHorizontalFlip \|
	\| Hardware \| 1× nvidia-a100-80gb \|


	#### Preprocessing

	Clips are sampled at 16 frames per video (torchcodec.clips_at_random_indices), resized/cropped to 256², then normalised by the processor.


	#### Training Hyperparameters

	- Training regime: [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

	#### Speeds, Sizes, Times [optional]

	<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

	[More Information Needed]

	## Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->

	### Testing Data, Factors & Metrics

	#### Testing Data

	<!-- This should link to a Dataset Card if possible. -->

	[More Information Needed]

	#### Factors

	<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

	[More Information Needed]

	#### Metrics

	<!-- These are the evaluation metrics being used, ideally with a description of why. -->

	\| Metric \| Definition \| Why we use it \|
	\| --------------------------- \| -------------------------------------------------------------------------------------------------------------- \| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- \|
	\| Top-1 accuracy \| Percentage of videos for which the predicted class label exactly matches the single ground-truth action. \| HMDB-51 is a 51-way closed-set task; the community almost exclusively quotes Top-1, making our scores directly comparable to prior work. \|
	\| (optional) Top-5 accuracy \| Video is considered correct if the ground-truth label appears in the five highest-probability classes. \| Helpful when the correct class is semantically close to others (e.g. run vs walk), but not reported here to keep the head-only baseline in line with earlier papers. \|

	Evaluation protocol

	Split-1 of HMDB-51 (the canonical 70 / 15 / 15 % stratified split) is used for validation during training and for final test reporting.

	We sample one 16-frame clip per video at 256 × 256 resolution and apply a single-crop evaluation, following the JEPA model card. This produces a 5-D tensor (B, T, C, H, W) that the VJEPA2VideoProcessor converts to model inputs.

	Accuracy is averaged over the full validation or test set; no class weighting is applied.

	### Results

	\| Split \| Epochs \| Top-1 accuracy \|
	\| ------------------------------- \| :----: \| :-----------------: \|
	\| Validation \| 1 → 5 \| 14.2 % → 41.9 % \|
	\| Test (single-crop, single-clip) \| — \| 42.9 % \|

	<sub>Numbers come from the run shown in the training logs (runs/vjepa2_hmdb51).</sub>

	How it compares
	\| Method (ViT-L backbone unless noted) \| Trainable params \| Clips / crops at test \| HMDB-51 Top-1 \|
	\| ------------------------------------- \| ---------------- \| --------------------- \| ------------------------------------ \|
	\| This work – head-only JEPA-L \| 1 M (0.3 %) \| 1 ✕ 1 \| 42.9 % \|
	\| Linear probe VideoMAE-B \| 0.1 % \| 1 ✕ 1 \| 38.9 % ([arxiv.org][1]) \|
	\| Linear probe TimeSformer-B-IN pt \| full-frozen \| 3 ✕ 10 \| 42.9 % (val) ([github.com][2]) \|
	\| AdaptFormer (last-block adapters) \| 1 % \| 1 ✕ 1 \| 46.1 % ([proceedings.neurips.cc][3]) \|
	\| CVPT visual-prompt tuning \| <1 % \| 3 ✕ 10 \| 57 % \|
	\| Full fine-tune TimeSformer-B \| 100 % \| 3 ✕ 10 \| 64 % ([proceedings.neurips.cc][3]) \|
	\| Full fine-tune VideoMAE-B \| 100 % \| 3 ✕ 10 \| 73 % ([arxiv.org][1]) \|
	\| VideoMAE V2-G (giant) \| 100 % \| 3 ✕ 10 \| 86 % ([arxiv.org][4]) \|
	\| InBrwSANet (CNN + SA) \| 100 % \| 3 ✕ 10 \| 77 % ([researchgate.net][5]) \|

	[1]: https://arxiv.org/pdf/2203.12602 "[PDF] VideoMAE: Masked Autoencoders are Data-Efficient ... - arXiv"
	[2]: https://github.com/facebookresearch/TimeSformer/issues/19 "About UCF101 and HMDB51 results · Issue #19 - GitHub"
	[3]: https://proceedings.neurips.cc/paper_files/paper/2022/file/69e2f49ab0837b71b0e0cb7c555990f8-Paper-Conference.pdf "[PDF] Adapting Vision Transformers for Scalable Visual Recognition"
	[4]: https://arxiv.org/html/2402.08875v4 "Advancing Human Action Recognition with Foundation Models ..."
	[5]: https://www.researchgate.net/publication/392129870_InBRwSANet_Self-attention_based_parallel_inverted_residual_bottleneck_architecture_for_human_action_recognition_in_smart_cities "(PDF) InBRwSANet: Self-attention based parallel inverted residual ..."


	Take-away

	42 – 43 % is in the upper range of published “backbone-frozen” baselines; unlocking a few transformer blocks, adding LoRA / prompt adapters, or running a full fine-tune typically raises HMDB-51 accuracy into the 55 – 70 % bracket. See the Bias, Risks & Limitations and Recommendations sections for caveats and upgrade suggestions.

	#### Summary



	## Model Examination [optional]

	<!-- Relevant interpretability work for the model goes here -->

	[More Information Needed]

	## Environmental Impact

	<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

	- Hardware Type: [More Information Needed]
	- Hours used: [More Information Needed]
	- Cloud Provider: [More Information Needed]
	- Compute Region: [More Information Needed]
	- Carbon Emitted: [More Information Needed]

	## Technical Specifications

	### Model Architecture and Objective

	ViT-Large (307 M params backbone) within the V-JEPA 2 framework.

	16 × 16 image patches over 256² input; 16-frame temporal tube.

	Classification head: two MLP layers (hidden 4 096 → 51 classes).

	### Compute Infrastructure

	[More Information Needed]

	#### Hardware

	[More Information Needed]

	#### Software

	[More Information Needed]

	## Citation

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	BibTeX:

	@misc{shelar2025vjepa2hmdb51,
	title = {V-JEPA2 ViT-L fine-tuned on HMDB-51},
	author = {Sujit Shelar},
	year = {2025},
	howpublished = {\url{https://huggingface.co/SujitShelar/vjepa2-vitl-fpc16-256-hmdb51}},
	note = {Fine-tuned from Assran et al. (2025) V-JEPA 2.}
	}


	APA:

	[More Information Needed]

	## Glossary [optional]

	<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

	[More Information Needed]

	## More Information [optional]

	[More Information Needed]

	## Model Card Authors [optional]

	[More Information Needed]

	## Model Card Contact

	[More Information Needed]