Update README.md
Browse files
README.md
CHANGED
@@ -18,12 +18,12 @@ tags: []
|
|
18 |
V-JEPA 2 is a self-supervised video backbone trained on >1 M h of internet video; Meta released checkpoints with a Something-Something v2 action head. I freeze that backbone and fine-tune only the classifier head on the HMDB-51 benchmark (6 766 clips, 51 classes) for 5 epochs. The resulting model reaches competitive top-1 accuracy (see Evaluation)
|
19 |
|
20 |
- **Developed by:** Sujit Shelar
|
21 |
-
- **Funded by
|
22 |
-
- **Shared by
|
23 |
- **Model type:** Vision (video); no text inputs
|
24 |
-
- **Language(s) (NLP):** [More Information Needed]
|
25 |
-
- **License:** MIT – identical to the upstream V-JEPA 2 weights
|
26 |
-
- **Finetuned from model
|
27 |
|
28 |
### Model Sources [optional]
|
29 |
|
@@ -115,7 +115,7 @@ HMDB-51 (CC BY-4.0, 6 766 clips across 51 classes). I stratify 70 / 15 / 15 % in
|
|
115 |
| Hardware | 1× nvidia-a100-80gb |
|
116 |
|
117 |
|
118 |
-
#### Preprocessing
|
119 |
|
120 |
Clips are sampled at 16 frames per video (torchcodec.clips_at_random_indices), resized/cropped to 256², then normalised by the processor.
|
121 |
|
@@ -220,7 +220,7 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
|
|
220 |
- **Compute Region:** [More Information Needed]
|
221 |
- **Carbon Emitted:** [More Information Needed]
|
222 |
|
223 |
-
## Technical Specifications
|
224 |
|
225 |
### Model Architecture and Objective
|
226 |
|
@@ -242,7 +242,7 @@ Classification head: two MLP layers (hidden 4 096 → 51 classes).
|
|
242 |
|
243 |
[More Information Needed]
|
244 |
|
245 |
-
## Citation
|
246 |
|
247 |
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
248 |
|
|
|
18 |
V-JEPA 2 is a self-supervised video backbone trained on >1 M h of internet video; Meta released checkpoints with a Something-Something v2 action head. I freeze that backbone and fine-tune only the classifier head on the HMDB-51 benchmark (6 766 clips, 51 classes) for 5 epochs. The resulting model reaches competitive top-1 accuracy (see Evaluation)
|
19 |
|
20 |
- **Developed by:** Sujit Shelar
|
21 |
+
- **Funded by :** self-funded (personal compute credits)
|
22 |
+
- **Shared by :** V-JEPA 2 ViT-Large (16 frame, 256² patch) video-encoder with a 51-way classification head
|
23 |
- **Model type:** Vision (video); no text inputs
|
24 |
+
- **Language(s) (NLP) :** [More Information Needed]
|
25 |
+
- **License :** MIT – identical to the upstream V-JEPA 2 weights
|
26 |
+
- **Finetuned from model :** facebook/vjepa2-vitl-fpc16-256-ssv2
|
27 |
|
28 |
### Model Sources [optional]
|
29 |
|
|
|
115 |
| Hardware | 1× nvidia-a100-80gb |
|
116 |
|
117 |
|
118 |
+
#### Preprocessing
|
119 |
|
120 |
Clips are sampled at 16 frames per video (torchcodec.clips_at_random_indices), resized/cropped to 256², then normalised by the processor.
|
121 |
|
|
|
220 |
- **Compute Region:** [More Information Needed]
|
221 |
- **Carbon Emitted:** [More Information Needed]
|
222 |
|
223 |
+
## Technical Specifications
|
224 |
|
225 |
### Model Architecture and Objective
|
226 |
|
|
|
242 |
|
243 |
[More Information Needed]
|
244 |
|
245 |
+
## Citation
|
246 |
|
247 |
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
248 |
|