HuggingFaceTB
/

SmolVLM2-500M-Video-Instruct

@@ -30,11 +30,11 @@ SmolVLM2-500M-Video is a tiny video model, member of the SmolVLM family. It acce
 ## Resources
 - **Demo:** [Video Highlight Generator](https://huggingface.co/spaces/HuggingFaceTB/SmolVLM2-HighlightGenerator)
-- **Blog:** [Blog post](TODO)
 ## Uses
-SmolVLM2 can be used for inference on multimodal (video / image / text) tasks where the input comprises text queries along with video or one or more images. Text and media files can be interleaved arbitrarily, enabling tasks like captioning, visual question answering, and storytelling based on visual content. The model does not support image or video generation.
 To fine-tune SmolVLM2 on a specific task, you can follow [the fine-tuning tutorial](UPDATE).
@@ -53,13 +53,11 @@ We evaluated the performance of the SmolVLM2 family on the following scientific
 You can use transformers to load, infer and fine-tune SmolVLM.
 ### Model optimizations
 ## Misuse and Out-of-scope Use
 SmolVLM is not intended for high-stakes scenarios or critical decision-making processes that affect an individual's well-being or livelihood. The model may produce content that appears factual but may not be accurate. Misuse includes, but is not limited to:
@@ -82,11 +80,11 @@ We release the SmolVLM2 checkpoints under the Apache 2.0 license.
 ## Training Data
-SmolVLM2 used 3.3M samples for training originally from ten different datasets: : LlaVa Onevision, M4-Instruct, Mammoth, LlaVa Video 178K, FineVideo, VideoStar, VRipt, Vista-400K, MovieChat and ShareGPT4Video.
 In the following plots we give a general overview of the samples across modalities and the source of those samples.
 <center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
 </center>
-### Detailed videw
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_datadetails.png" width="auto" height="auto" alt="Image description">

 ## Resources
 - **Demo:** [Video Highlight Generator](https://huggingface.co/spaces/HuggingFaceTB/SmolVLM2-HighlightGenerator)
+- **Blog:** [Blog post](https://huggingface.co/blog/smolvlm2)
 ## Uses
+SmolVLM2 can be used for inference on multimodal (video / image / text) tasks where the input consists of text queries along with video or one or more images. Text and media files can be interleaved arbitrarily, enabling tasks like captioning, visual question answering, and storytelling based on visual content. The model does not support image or video generation.
 To fine-tune SmolVLM2 on a specific task, you can follow [the fine-tuning tutorial](UPDATE).
 You can use transformers to load, infer and fine-tune SmolVLM.
+[TODO]
 ### Model optimizations
 ## Misuse and Out-of-scope Use
 SmolVLM is not intended for high-stakes scenarios or critical decision-making processes that affect an individual's well-being or livelihood. The model may produce content that appears factual but may not be accurate. Misuse includes, but is not limited to:
 ## Training Data
+SmolVLM2 used 3.3M samples for training originally from ten different datasets: [LlaVa Onevision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [M4-Instruct](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data), [Mammoth](https://huggingface.co/datasets/MAmmoTH-VL/MAmmoTH-VL-Instruct-12M), [LlaVa Video 178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K), [FineVideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo), [VideoStar](https://huggingface.co/datasets/orrzohar/Video-STaR), [VRipt](https://huggingface.co/datasets/Mutonix/Vript), [Vista-400K](https://huggingface.co/datasets/TIGER-Lab/VISTA-400K), [MovieChat](https://huggingface.co/datasets/Enxin/MovieChat-1K_train) and [ShareGPT4Video](https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video).
 In the following plots we give a general overview of the samples across modalities and the source of those samples.
 <center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
 </center>
+### Details
 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_datadetails.png" width="auto" height="auto" alt="Image description">