File size: 1,590 Bytes
9d9e61a 0d7b57a 9d9e61a c9bac54 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
---
license: mit
pipeline_tag: video-classification
---
## Introduction
This repository contains the 6B model of the paper [InternVideo2](https://arxiv.org/pdf/2403.15377) in stage 2.
Code: https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2/multi_modality
## 🚀 Installation
Please refer to https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/multi_modality/INSTALL.md
## Usage
```python
import cv2
from transformers import AutoModel
from modeling_internvideo2 import (retrieve_text, vid2tensor, _frame_from_video,)
if __name__ == '__main__':
model = AutoModel.from_pretrained("OpenGVLab/InternVideo2-Stage2_6B", trust_remote_code=True).eval()
video = cv2.VideoCapture('example1.mp4')
frames = [x for x in _frame_from_video(video)]
text_candidates = ["A playful dog and its owner wrestle in the snowy yard, chasing each other with joyous abandon.",
"A man in a gray coat walks through the snowy landscape, pulling a sleigh loaded with toys.",
"A person dressed in a blue jacket shovels the snow-covered pavement outside their house.",
"A cat excitedly runs through the yard, chasing a rabbit.",
"A person bundled up in a blanket walks through the snowy landscape, enjoying the serene winter scenery."]
texts, probs = retrieve_text(frames, text_candidates, model=model, topk=5)
for t, p in zip(texts, probs):
print(f'text: {t} ~ prob: {p:.4f}')
vidtensor = vid2tensor('example1.mp4', fnum=4)
feat = model.get_vid_feat(vidtensor)
``` |