SujitShelar
/

vjepa2-vitl-fpc16-256-hmdb51

@@ -18,12 +18,12 @@ tags: []
 V-JEPA 2 is a self-supervised video backbone trained on >1 M h of internet video; Meta released checkpoints with a Something-Something v2 action head. I freeze that backbone and fine-tune only the classifier head on the HMDB-51 benchmark (6 766 clips, 51 classes) for 5 epochs. The resulting model reaches competitive top-1 accuracy (see Evaluation)
 - **Developed by:** 	Sujit Shelar
-- **Funded by [optional]:** 	self-funded (personal compute credits)
-- **Shared by [optional]:** V-JEPA 2 ViT-Large (16 frame, 256² patch) video-encoder with a 51-way classification head
 - **Model type:** 	Vision (video); no text inputs
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** MIT – identical to the upstream V-JEPA 2 weights
-- **Finetuned from model [optional]:** facebook/vjepa2-vitl-fpc16-256-ssv2
 ### Model Sources [optional]
@@ -115,7 +115,7 @@ HMDB-51 (CC BY-4.0, 6 766 clips across 51 classes). I stratify 70 / 15 / 15 % in
 | Hardware         | 1× nvidia-a100-80gb                                    |
-#### Preprocessing [optional]
 Clips are sampled at 16 frames per video (torchcodec.clips_at_random_indices), resized/cropped to 256², then normalised by the processor.
@@ -220,7 +220,7 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
 - **Compute Region:** [More Information Needed]
 - **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
 ### Model Architecture and Objective
@@ -242,7 +242,7 @@ Classification head: two MLP layers (hidden 4 096 → 51 classes).
 [More Information Needed]
-## Citation [optional]
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

 V-JEPA 2 is a self-supervised video backbone trained on >1 M h of internet video; Meta released checkpoints with a Something-Something v2 action head. I freeze that backbone and fine-tune only the classifier head on the HMDB-51 benchmark (6 766 clips, 51 classes) for 5 epochs. The resulting model reaches competitive top-1 accuracy (see Evaluation)
 - **Developed by:** 	Sujit Shelar
+- **Funded by :** 	self-funded (personal compute credits)
+- **Shared by :** V-JEPA 2 ViT-Large (16 frame, 256² patch) video-encoder with a 51-way classification head
 - **Model type:** 	Vision (video); no text inputs
+- **Language(s) (NLP) :** [More Information Needed]
+- **License :** MIT – identical to the upstream V-JEPA 2 weights
+- **Finetuned from model :** facebook/vjepa2-vitl-fpc16-256-ssv2
 ### Model Sources [optional]
 | Hardware         | 1× nvidia-a100-80gb                                    |
+#### Preprocessing
 Clips are sampled at 16 frames per video (torchcodec.clips_at_random_indices), resized/cropped to 256², then normalised by the processor.
 - **Compute Region:** [More Information Needed]
 - **Carbon Emitted:** [More Information Needed]
+## Technical Specifications
 ### Model Architecture and Objective
 [More Information Needed]
+## Citation
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->