aerickso commited on
Commit
a50d69c
·
verified ·
1 Parent(s): cdb7ec7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +214 -8
README.md CHANGED
@@ -1,23 +1,229 @@
1
  ---
2
- {}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
- # Cliptagger
5
- #### A targeted keyframe annotation model.
6
 
7
- This model is a specific development built off of Gemma-3 to provide a disentangled and precise description of video keyframes. The goal is to make descriptive output fields for downstream tasks, including classification, clustering, and by extension search.
8
 
9
- The ouput fields, after being provided a keyframe image, will look something like the below.
10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ```
 
 
 
 
 
 
 
 
 
 
 
12
  {
13
- "description": "A detailed, factual account of what is visibly happening in the image (4 sentences max). Only mention concrete elements or actions that are clearly shown. Do not include anything about how the image is styled, shot, or composed.",
14
  "objects": ["object1 with relevant visual details", "object2 with relevant visual details", ...],
15
  "actions": ["action1 with participants and context", "action2 with participants and context", ...],
16
  "environment": "Detailed factual description of the setting and atmosphere based on visible cues (e.g., interior of a classroom with fluorescent lighting, or outdoor forest path with snow-covered trees).",
17
  "content_type": "The type of content it is, e.g. 'real-world footage', 'video game', 'animation', 'cartoon', 'CGI', 'VTuber', etc.",
18
  "specific_style": "Specific genre, aesthetic, or platform style (e.g., anime, 3D animation, mobile gameplay, vlog, tutorial, news broadcast, etc.)",
19
  "production_quality": "Visible production level: e.g., 'professional studio', 'amateur handheld', 'webcam recording', 'TV broadcast', etc.",
20
- "summary": "One clear, comprehensive sentence summarizing the visual content of the frame.",
21
  "logos": ["logo1 with visual description", "logo2 with visual description", ...]
22
  }
23
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - VLM
7
+ - video-understanding
8
+ - image-captioning
9
+ - gemma
10
+ - json-mode
11
+ - structured-output
12
+ - video-analysis
13
+ base_model: google/gemma-12b
14
+ pipeline_tag: image-text-to-text
15
+ model-index:
16
+ - name: ClipTagger-12b
17
+ results:
18
+ - task:
19
+ type: image-to-text
20
+ name: Video Frame Captioning
21
+ metrics:
22
+ - name: Average Judge Score
23
+ type: quality
24
+ value: 3.53
25
+ - name: ROUGE-1
26
+ type: rouge-1
27
+ value: 0.674
28
+ - name: ROUGE-L
29
+ type: rouge-l
30
+ value: 0.520
31
+ - name: BLEU
32
+ type: bleu
33
+ value: 0.267
34
  ---
 
 
35
 
36
+ # ClipTagger-12b
37
 
38
+ ![ClipTagger-12b](./assets/grass-x-inference.png)
39
 
40
+ ## Model Description
41
+
42
+ **ClipTagger-12b** is a 12-billion parameter vision-language model (VLM) designed for video understanding at massive scale. Developed by [Inference.net](https://inference.net) in collaboration with [Grass](https://grass.io), this model was created to meet the demanding requirements of trillion-scale video frame captioning workloads.
43
+
44
+ **ClipTagger-12b exceeds or matches the performance of GPT-4.1 and Claude 4 Sonnet, while costing 15x less per generation.**
45
+
46
+ The model generates structured, schema-consistent JSON outputs for every video frame, making it ideal for building searchable video databases, content moderation systems, and accessibility tools. It maintains temporal consistency across frames while delivering frontier-quality performance at a fraction of the cost of closed-source alternatives.
47
+
48
+ ### Key Features
49
+
50
+ - **Frontier-quality performance** - Comparable to top closed models in captioning quality
51
+ - **Production-ready** - Battle-tested on trillion-scale video frame captioning workloads
52
+ - **Schema-consistent JSON** - Reliable structured output for every frame
53
+ - **Cost-efficient** - Optimized for high-throughput inference
54
+ - **Open source** - Build and deploy without proprietary API dependencies
55
+
56
+ ## Architecture
57
+
58
+ ClipTagger-12b is based on the Gemma-12B architecture and has been optimized with FP8 quantization for maximum throughput on modern GPUs. The model is specifically tuned for RTX 40-series and H100 GPUs, leveraging native FP8 support for efficient inference.
59
+
60
+ ### Technical Specifications
61
+ - **Parameters**: 12 billion
62
+ - **Base Architecture**: Gemma-12B
63
+ - **Quantization**: FP8 (no quality loss vs bf16)
64
+ - **Input**: Single video frame per request
65
+ - **Output**: Structured JSON with fixed schema
66
+ - **Supported Formats**: JPEG, PNG, WebP, GIF
67
+ - **Max Image Size**: 1MB
68
+
69
+ ## Training
70
+
71
+ The model was trained on 1 million carefully curated single-frame samples from publicly available video data. Training employed knowledge distillation from a high-quality teacher model to ensure consistent, accurate outputs while maintaining the ability to generalize across diverse video content types.
72
+
73
+ ### Training Process
74
+ - **Dataset Size**: 1M video frames
75
+ - **Training Method**: Teacher-student distillation
76
+ - **Data Source**: Publicly available video content
77
+ - **Focus**: Single-frame understanding with temporal awareness
78
+
79
+ ## Benchmarks
80
+
81
+ ClipTagger-12b achieves **equal or superior performance** compared to the leading closed-source models across all major evaluation metrics. Despite being open-source and significantly more cost-effective, our model **outperforms Claude 4 Sonnet across every metric** and achieves **comparable quality to GPT-4.1**.
82
+
83
+ Performance metrics on our internal evaluation set:
84
+ | Model | Avg Judge Score | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU |
85
+ |-------|-----------------|---------|---------|---------|------|
86
+ | cliptagger_12b | **3.53** | **0.674** | **0.404** | **0.520** | **0.267** |
87
+ | claude_4_sonnet | 3.16 | 0.463 | 0.179 | 0.281 | 0.060 |
88
+ | gpt_4.1 | 3.64 | 0.581 | 0.260 | 0.376 | 0.119 |
89
+
90
+ We used Gemini-2.5-Pro as the judge model, which ranks ClipTagger-12b roughly equal to GPT-4.1, and better than Claude 4 Sonnet.
91
+
92
+ <img src="./assets/judge-score.png" alt="Average Judge Score Comparison" width="100%" />
93
+
94
+
95
+ FP8 quantization showed no measurable quality degradation compared to bf16 precision.
96
+
97
+ ## Cost Comparison
98
+
99
+ ClipTagger-12b delivers frontier-quality performance at a fraction of the cost of closed-source alternatives. Based on typical usage patterns (700 input tokens and 250 output tokens per generation), here's how the costs compare:
100
+
101
+ <img src="./assets/cost.png" alt="Cost Comparison Per 1 Million Generations" width="100%" />
102
+
103
+ ClipTagger-12b offers **15x cost savings** compared to GPT-4.1 and **17x cost savings** compared to Claude 4 Sonnet, while maintaining comparable quality metrics.
104
+
105
+ | Model | Input Cost/MTok | Output Cost/MTok | Cost per 1M Generations | Cost per Generation |
106
+ | --------------- | --------------- | ---------------- | ----------------------- | ------------------- |
107
+ | ClipTagger-12b | $0.30 | $0.50 | $335 | $0.000335 |
108
+ | GPT-4.1 | $3.00 | $12.00 | $5,100 | $0.0051 |
109
+ | Claude 4 Sonnet | $3.00 | $15.00 | $5,850 | $0.00585 |
110
+
111
+
112
+ ## Usage
113
+
114
+ ### API Access
115
+
116
+ For production deployments, we recommend using our managed API service which includes advanced features like batch processing, webhooks, and automatic scaling:
117
+
118
+ **[Run ClipTagger-12b via Inference.net API →](https://docs.inference.net/use-cases/video-understanding)**
119
+
120
+ ### Required Prompts
121
+
122
+ The model requires specific system and user prompts for optimal performance. Use these prompts exactly as shown:
123
+
124
+ #### System Prompt
125
  ```
126
+ You are an image annotation API trained to analyze YouTube video keyframes. You will be given instructions on the output format, what to caption, and how to perform your job. Follow those instructions. For descriptions and summaries, provide them directly and do not lead them with 'This image shows' or 'This keyframe displays...', just get right into the details.
127
+ ```
128
+
129
+ #### User Prompt
130
+ ```
131
+ You are an image annotation API trained to analyze YouTube video keyframes. You must respond with a valid JSON object matching the exact structure below.
132
+
133
+ Your job is to extract detailed **factual elements directly visible** in the image. Do not speculate or interpret artistic intent, camera focus, or composition. Do not include phrases like "this appears to be", "this looks like", or anything about the image itself. Describe what **is physically present in the frame**, and nothing more.
134
+
135
+ Return JSON in this structure:
136
+
137
  {
138
+ "description": "A detailed, factual account of what is visibly happening (4 sentences max). Only mention concrete elements or actions that are clearly shown. Do not include anything about how the image is styled, shot, or composed. Do not lead the description with something like 'This image shows' or 'this keyframe is...', just get right into the details.",
139
  "objects": ["object1 with relevant visual details", "object2 with relevant visual details", ...],
140
  "actions": ["action1 with participants and context", "action2 with participants and context", ...],
141
  "environment": "Detailed factual description of the setting and atmosphere based on visible cues (e.g., interior of a classroom with fluorescent lighting, or outdoor forest path with snow-covered trees).",
142
  "content_type": "The type of content it is, e.g. 'real-world footage', 'video game', 'animation', 'cartoon', 'CGI', 'VTuber', etc.",
143
  "specific_style": "Specific genre, aesthetic, or platform style (e.g., anime, 3D animation, mobile gameplay, vlog, tutorial, news broadcast, etc.)",
144
  "production_quality": "Visible production level: e.g., 'professional studio', 'amateur handheld', 'webcam recording', 'TV broadcast', etc.",
145
+ "summary": "One clear, comprehensive sentence summarizing the visual content of the frame. Like the description, get right to the point.",
146
  "logos": ["logo1 with visual description", "logo2 with visual description", ...]
147
  }
148
+
149
+ Rules:
150
+ - Be specific and literal. Focus on what is explicitly visible.
151
+ - Do NOT include interpretations of emotion, mood, or narrative unless it's visually explicit.
152
+ - No artistic or cinematic analysis.
153
+ - Always include the language of any text in the image if present as an object, e.g. "English text", "Japanese text", "Russian text", etc.
154
+ - Maximum 10 objects and 5 actions.
155
+ - Return an empty array for 'logos' if none are present.
156
+ - Always output strictly valid JSON with proper escaping.
157
+ - Output **only the JSON**, no extra text or explanation.
158
+ ```
159
+
160
+ ### Inference Parameters
161
+
162
+ - **Temperature**: 0.1 (recommended for consistency)
163
+ - **Max Tokens**: 2000
164
+ - **Response Format**: `{"type": "json_object"}`
165
+
166
+ ### Output Schema
167
+
168
+ The model outputs a fixed JSON structure with the following fields:
169
+
170
+ ```json
171
+ {
172
+ "description": "string - Detailed factual description (max 4 sentences)",
173
+ "objects": ["array of strings - Up to 10 objects with visual details"],
174
+ "actions": ["array of strings - Up to 5 actions with context"],
175
+ "environment": "string - Setting and atmosphere description",
176
+ "content_type": "string - Type of visual content",
177
+ "specific_style": "string - Genre or style classification",
178
+ "production_quality": "string - Production level assessment",
179
+ "summary": "string - Single sentence summary",
180
+ "logos": ["array of strings - Detected logos with descriptions"]
181
+ }
182
+ ```
183
+
184
+ ## Example Output
185
+
186
+ Given a nature scene with a wooden boardwalk through grassland:
187
+
188
+ ```json
189
+ {
190
+ "description": "A wooden boardwalk path extends from the foreground into the distance, cutting through a field of tall, vibrant green grass. The path is flanked on both sides by the dense grass. In the background, a line of trees is visible on the horizon under a blue sky with scattered white clouds.",
191
+ "objects": [
192
+ "Wooden boardwalk",
193
+ "Tall green grass",
194
+ "Blue sky",
195
+ "White clouds",
196
+ "Trees"
197
+ ],
198
+ "actions": [],
199
+ "environment": "An outdoor, natural landscape, likely a marsh or wetland, on a clear day. The scene is characterized by a wooden boardwalk, lush green vegetation, and a bright blue sky with wispy clouds.",
200
+ "content_type": "real-world footage",
201
+ "specific_style": "landscape photography",
202
+ "production_quality": "professional photography",
203
+ "summary": "A wooden boardwalk path winds through a lush green field under a bright blue sky with scattered clouds.",
204
+ "logos": []
205
+ }
206
+ ```
207
+
208
+ ## Use Cases
209
+
210
+ - **Video Search & Discovery** - Build searchable databases with structured metadata
211
+ - **Content Moderation** - Automated content analysis and categorization
212
+ - **Accessibility** - Generate consistent alt-text and scene descriptions
213
+ - **Ad Verification** - Track product visibility and brand appearances
214
+ - **Video Analytics** - Extract insights from large video collections
215
+ - **Content Management** - Automatic tagging and organization of video libraries
216
+
217
+ ## Interested in training your own model?
218
+
219
+ Contact us at [[email protected]](mailto:[email protected]) for a free consultation with our research team.
220
+
221
+ ## Support
222
+
223
+ - **Documentation**: [docs.inference.net](https://docs.inference.net/use-cases/video-understanding)
224
+ - **API Access**: Get $25 in free credits when you [sign up](https://inference.net/register) for an account
225
+ - **Email**: [email protected]
226
+
227
+ ## License
228
+
229
+ This model is released under the Apache-2.0 license, allowing for commercial use and modification with proper attribution.