JenJSun commited on
Commit
305bd08
·
verified ·
1 Parent(s): 15a6bfb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -7
README.md CHANGED
@@ -6,6 +6,7 @@ tags:
6
  ---
7
 
8
  # VideoPrism Model Card
 
9
  **Paper**: https://huggingface.co/papers/2402.13217
10
 
11
  **arXiv**: https://arxiv.org/pdf/2402.13217
@@ -18,31 +19,54 @@ VideoPrism is a foundational video encoder that enables state-of-the-art perform
18
 
19
  ## Model details
20
 
 
 
 
 
 
 
 
 
 
 
21
  ### Model description
 
22
  VideoPrism-B/L are the composition of a Vision Transformer image encoder and four temporal-attention Transformer layers. The image encoder and text encoder are initialized from [CoCa](https://arxiv.org/abs/2205.01917), which is trained on WebLI following the CoCa recipes. VideoPrism is based on the [ViViT](https://arxiv.org/abs/2103.15691) factorized video encoder architecture.
23
 
24
  ### Inputs and outputs
25
  The models take videos with shape (num_frames, 288, 288, 3) as inputs and outputs embeddings with shape (num_frames * 16 * 16, feature_channels) which could be reshaped into (num_frames, 16, 16, feature_channels) for spatiotemporal representations. During model training, num_frames is set to 16 and 8 for VideoPrism-B and VideoPrism-L, respectively. Both models are expected to work with arbitrary num_frames by interpolating the temporal positional embeddings.
26
 
 
 
27
  ## Uses
28
  VideoPrism has a wide range of applications across various video understanding scenarios. The following lists some primary use cases and yet is not comprehensive. The purpose of this list is to provide contextual information the model creators considered as part of model training and development.
29
  * **Video classification**: By feeding the video embeddings to a lightweight classifier, we can tackle video action recognition, a fundamental task in video understanding, under various scenarios.
30
  * **Temporal and spatiotemporal localization**: We can also use the model to localize actions of interest spatially across time by equipping it with a bounding box proposal.
31
  * **Video retrieval and open-set classification**: By pairing up the video embeddings with a text encoder in the CLIP fashion, we can do text-video retrieval and open-set video classification.
32
 
 
33
  ## Ethical considerations and risks
34
  The model inherits the safety benefits and safety risks associated with the image encoder CoCa and the training datasets described above. We recommend that the model should not be used for downstream applications without prior assessment and mitigation of downstream application-specific security and fairness concerns.
35
  * Data bias: Large datasets scraped from the internet can contain inherent biases, leading to skewed model performance and potentially discriminatory outputs. The presence of "noisy parallel text" like ASR transcripts introduces potential inaccuracies and biases from the speech-to-text process.
36
  * Content moderation: The sheer volume of data (36M video-caption pairs and 582M video clips) raises concerns about the presence of objectionable or inappropriate content within the training data, which could lead to harmful model outputs.
37
- * Ethical use: As with any powerful video understanding model, there are risks of misuse, such as in surveillance or the propagation of misinformation.
38
  * Limitations: The reliance on potentially noisy text data can limit the models understanding of the true video content. Further research is needed to refine the models ability to understand long form videos, geometric information in videos, and non-semantic cues.
39
 
 
40
  ## How to get started with the model
41
- Use the code at our GitHub to get started with the model: https://github.com/google-deepmind/videoprism.
 
 
 
 
 
 
 
42
 
43
  ## Training details
44
 
45
  ### Training data
 
46
  VideoPrism is pre-trained on a wide range of videos (36M video-caption pairs and 582M video clips), including the datasets below. Note that the number of clips are subject to change due to wipeout according to policy.
47
 
48
  | Pretraining datasets | Public | Domain | Caption source | Caption quality | # of videos | # of clips |
@@ -57,32 +81,60 @@ VideoPrism is pre-trained on a wide range of videos (36M video-caption pairs and
57
 
58
  ## Evaluation
59
 
 
 
 
60
  ### Results on video-focused tasks with frozen backbones
 
61
  | Dataset | K400 | MiT | SSv2 | D48 | Charades | ActivityNet | AVA | AVA-K |
62
  | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
63
- | VideoPrism-B (public) | 82.9 | 39.7 | 62.2 | 64.3 | 43.5 | 36.5 | 28.3 | 30.8 |
64
- | VideoPrism-L (public) | 85.0 | 43.3 | 64.6 | 67.6 | 53.2 | 37.0 | 32.4 | 34.5 |
65
  | VideoPrism-B (paper) | 84.2 | 40.8 | 63.6 | 67.4 | 40.4 | 36.6 | 30.6 | 31.8 |
66
  | VideoPrism-g (paper) | 87.2 | 45.5 | 68.5 | 71.3 | 62.3 | 37.8 | 36.2 | 37.3 |
67
  | Prior SOTA (B) | 77.1 | 34.0 | 58.2 | 55.6 | 33.3 | 35.8 | 21.1 | 25.9 |
68
  | Prior SOTA (L+) | 82.8 | 40.3 | 67.4 | 69.6 | 39.9 | 36.7 | 24.4 | 26.2 |
69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
- "Public" denotes models we released in this repository. "Paper" and "Prior SOTA" denote our models and previous best-performing models reported in the paper, respectively. Our public models perform slightly worse than the paper models due to different pre-training image-text data we used subject to data policy.
72
 
73
  ## Implementation information
74
 
75
  ### Model architecture
 
76
  Vision model is a [ViViT](https://arxiv.org/abs/2103.15691) factorized video encoder architecture, initialized from the Vision Transformer image encoder ([CoCa](https://arxiv.org/abs/2205.01917)) followed by four temporal-attention Transformer layers.
77
 
78
  ### Hardware
 
79
  VideoPrism was trained using [Tensor Processing Unit
80
  (TPU)](https://cloud.google.com/tpu/docs/intro-to-tpu) hardware.
81
 
82
  #### Software
 
83
  JAX, Flax
84
 
85
  ## Citation
 
86
  VideoPrism:
87
  ```
88
  @inproceedings{zhao2024videoprism,
@@ -102,5 +154,3 @@ VideoGLUE benchmarks:
102
  year = {2024}
103
  }
104
  ```
105
-
106
-
 
6
  ---
7
 
8
  # VideoPrism Model Card
9
+
10
  **Paper**: https://huggingface.co/papers/2402.13217
11
 
12
  **arXiv**: https://arxiv.org/pdf/2402.13217
 
19
 
20
  ## Model details
21
 
22
+ We release the following model variants:
23
+
24
+ | Model Name | Configuration Name | Model Type | Backbone | #Params | File Size | Checkpoint |
25
+ | -------- | -------- | ------- | :-------: | :-------: | :-------: | :-------: |
26
+ | VideoPrism-B | `videoprism_public_v1_base` | Video encoder | ViT-B | 114M | 458MB | [link](https://huggingface.co/google/videoprism-base-f16r288) |
27
+ | VideoPrism-L | `videoprism_public_v1_large` | Video encoder | ViT-L | 354M | 1.42GB | [link](https://huggingface.co/google/videoprism-large-f8r288) |
28
+ | VideoPrism-LvT-B | `videoprism_lvt_public_v1_base` | Video-text encoders | ViT-B | 248M | 991MB | [link](https://huggingface.co/google/videoprism-lvt-base-f16r288) |
29
+ | VideoPrism-LvT-L | `videoprism_lvt_public_v1_large` | Video-text encoders | ViT-L | 580M | 2.30GB | [link](https://huggingface.co/google/videoprism-lvt-large-f8r288) |
30
+
31
+
32
  ### Model description
33
+
34
  VideoPrism-B/L are the composition of a Vision Transformer image encoder and four temporal-attention Transformer layers. The image encoder and text encoder are initialized from [CoCa](https://arxiv.org/abs/2205.01917), which is trained on WebLI following the CoCa recipes. VideoPrism is based on the [ViViT](https://arxiv.org/abs/2103.15691) factorized video encoder architecture.
35
 
36
  ### Inputs and outputs
37
  The models take videos with shape (num_frames, 288, 288, 3) as inputs and outputs embeddings with shape (num_frames * 16 * 16, feature_channels) which could be reshaped into (num_frames, 16, 16, feature_channels) for spatiotemporal representations. During model training, num_frames is set to 16 and 8 for VideoPrism-B and VideoPrism-L, respectively. Both models are expected to work with arbitrary num_frames by interpolating the temporal positional embeddings.
38
 
39
+ In video-text models, both video and text encoders produce global embeddings with shape `(feature_channels)`, whose similarities could be measured by cosine distances. We use the `c4_en` [SentencePiece](https://github.com/google/sentencepiece) model for text tokenization. During inference, embedding calculation for either modality can be skipped by providing `None` as the input.
40
+
41
  ## Uses
42
  VideoPrism has a wide range of applications across various video understanding scenarios. The following lists some primary use cases and yet is not comprehensive. The purpose of this list is to provide contextual information the model creators considered as part of model training and development.
43
  * **Video classification**: By feeding the video embeddings to a lightweight classifier, we can tackle video action recognition, a fundamental task in video understanding, under various scenarios.
44
  * **Temporal and spatiotemporal localization**: We can also use the model to localize actions of interest spatially across time by equipping it with a bounding box proposal.
45
  * **Video retrieval and open-set classification**: By pairing up the video embeddings with a text encoder in the CLIP fashion, we can do text-video retrieval and open-set video classification.
46
 
47
+
48
  ## Ethical considerations and risks
49
  The model inherits the safety benefits and safety risks associated with the image encoder CoCa and the training datasets described above. We recommend that the model should not be used for downstream applications without prior assessment and mitigation of downstream application-specific security and fairness concerns.
50
  * Data bias: Large datasets scraped from the internet can contain inherent biases, leading to skewed model performance and potentially discriminatory outputs. The presence of "noisy parallel text" like ASR transcripts introduces potential inaccuracies and biases from the speech-to-text process.
51
  * Content moderation: The sheer volume of data (36M video-caption pairs and 582M video clips) raises concerns about the presence of objectionable or inappropriate content within the training data, which could lead to harmful model outputs.
52
+ * Ethical use: As with any powerful video understanding model, there are risks of misuse, such as in surveillance or the propagation of misinformation.
53
  * Limitations: The reliance on potentially noisy text data can limit the models understanding of the true video content. Further research is needed to refine the models ability to understand long form videos, geometric information in videos, and non-semantic cues.
54
 
55
+
56
  ## How to get started with the model
57
+ To get started with our models, please see the code and examples in our [GitHub Repository](https://github.com/google-deepmind/videoprism).
58
+
59
+ ### Feedback and Questions
60
+
61
+ We welcome all questions and feedback! If you find a bug, have a feature request, or want to ask a question, please don't hesitate to **open an issue** on our GitHub repository.
62
+
63
+ We're excited to see what you build with VideoPrism! 🚀
64
+
65
 
66
  ## Training details
67
 
68
  ### Training data
69
+
70
  VideoPrism is pre-trained on a wide range of videos (36M video-caption pairs and 582M video clips), including the datasets below. Note that the number of clips are subject to change due to wipeout according to policy.
71
 
72
  | Pretraining datasets | Public | Domain | Caption source | Caption quality | # of videos | # of clips |
 
81
 
82
  ## Evaluation
83
 
84
+ In the tables below, "Public" denotes models we released in this repository. "Paper" and "Prior SOTA" denote our models and previous best-performing models reported in the paper, respectively. Our public models perform slightly worse than the paper models due to different pre-training image-text data we used subject to data policy.
85
+
86
+
87
  ### Results on video-focused tasks with frozen backbones
88
+
89
  | Dataset | K400 | MiT | SSv2 | D48 | Charades | ActivityNet | AVA | AVA-K |
90
  | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
91
+ | **VideoPrism-B (public)** | 82.9 | 39.7 | 62.2 | 64.3 | 43.5 | 36.5 | 28.3 | 30.8 |
92
+ | **VideoPrism-L (public)** | 85.0 | 43.3 | 64.6 | 67.6 | 53.2 | 37.0 | 32.4 | 34.5 |
93
  | VideoPrism-B (paper) | 84.2 | 40.8 | 63.6 | 67.4 | 40.4 | 36.6 | 30.6 | 31.8 |
94
  | VideoPrism-g (paper) | 87.2 | 45.5 | 68.5 | 71.3 | 62.3 | 37.8 | 36.2 | 37.3 |
95
  | Prior SOTA (B) | 77.1 | 34.0 | 58.2 | 55.6 | 33.3 | 35.8 | 21.1 | 25.9 |
96
  | Prior SOTA (L+) | 82.8 | 40.3 | 67.4 | 69.6 | 39.9 | 36.7 | 24.4 | 26.2 |
97
 
98
+ ### Zero-shot video-text retrieval
99
+
100
+ | Models | MSRVTT-1K (v2t) | MSRVTT-1K (t2v) | VATEX (v2t) | VATEX (t2v) | ActivityNet (v2t) | ActivityNet (t2v) |
101
+ | -------- | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: |
102
+ | **VideoPrism-LvT-B (public)** | 49.8 | 50.1 | 73.1 | 56.2 | 47.9 | 48.8 |
103
+ | **VideoPrism-LvT-L (public)** | 50.6 | 50.1 | 75.0 | 57.2 | 49.1 | 51.3 |
104
+ | VideoPrism-LvT-B (paper) | 50.2 | 51.4 | 76.2 | 57.7 | 47.9 | 49.6 |
105
+ | VideoPrism-LvT-g (paper) | 51.7 | 52.7 | 77.1 | 62.5 | 50.3 | 52.7 |
106
+ | Prior SOTA (B) | - | 34.0 | - | - | - | 30.6 |
107
+ | Prior SOTA (L+) | 45.4 | 43.9 | 73.6 | 53.2 | 40.7 | 42.8 |
108
+
109
+ ### Zero-shot video classification
110
+
111
+ | Models | K400 | SSv2 (Temporal) | SSv2 (Events) | NExT-QA (Hard) | Charades | Charades (STA) |
112
+ | -------- | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: |
113
+ | **VideoPrism-LvT-B (public)** | 69.2 | 14.6 | 11.3 | 31.1 | 26.9 | 48.6 |
114
+ | **VideoPrism-LvT-L (public)** | 72.4 | 18.0 | 12.4 | 32.1 | 32.4 | 50.2 |
115
+ | VideoPrism-LvT-B (paper) | 71.3 | 16.1 | 11.9 | 31.3 | 29.2 | 50.0 |
116
+ | VideoPrism-LvT-g (paper) | 74.6 | 18.6 | 15.7 | 32.7 | 32.4 | 50.4 |
117
+ | Prior SOTA (B) | - | 9.8 | 6.4 | 27.6 | 21.1 | - |
118
+ | Prior SOTA (L+) | 72.0 | 15.2 | 11.4 | 25.2 | 25.8 | 47.2 |
119
 
 
120
 
121
  ## Implementation information
122
 
123
  ### Model architecture
124
+
125
  Vision model is a [ViViT](https://arxiv.org/abs/2103.15691) factorized video encoder architecture, initialized from the Vision Transformer image encoder ([CoCa](https://arxiv.org/abs/2205.01917)) followed by four temporal-attention Transformer layers.
126
 
127
  ### Hardware
128
+
129
  VideoPrism was trained using [Tensor Processing Unit
130
  (TPU)](https://cloud.google.com/tpu/docs/intro-to-tpu) hardware.
131
 
132
  #### Software
133
+
134
  JAX, Flax
135
 
136
  ## Citation
137
+
138
  VideoPrism:
139
  ```
140
  @inproceedings{zhao2024videoprism,
 
154
  year = {2024}
155
  }
156
  ```