merve HF staff commited on
Commit
624ec44
·
verified ·
1 Parent(s): d52b9cd

Add reserved link, fix typos, link datasets

Files changed (1) hide show
  1. README.md +5 -7
README.md CHANGED
@@ -30,11 +30,11 @@ SmolVLM2-500M-Video is a tiny video model, member of the SmolVLM family. It acce
30
  ## Resources
31
 
32
  - **Demo:** [Video Highlight Generator](https://huggingface.co/spaces/HuggingFaceTB/SmolVLM2-HighlightGenerator)
33
- - **Blog:** [Blog post](TODO)
34
 
35
  ## Uses
36
 
37
- SmolVLM2 can be used for inference on multimodal (video / image / text) tasks where the input comprises text queries along with video or one or more images. Text and media files can be interleaved arbitrarily, enabling tasks like captioning, visual question answering, and storytelling based on visual content. The model does not support image or video generation.
38
 
39
  To fine-tune SmolVLM2 on a specific task, you can follow [the fine-tuning tutorial](UPDATE).
40
 
@@ -53,13 +53,11 @@ We evaluated the performance of the SmolVLM2 family on the following scientific
53
 
54
  You can use transformers to load, infer and fine-tune SmolVLM.
55
 
56
-
57
 
58
 
59
  ### Model optimizations
60
 
61
-
62
-
63
  ## Misuse and Out-of-scope Use
64
 
65
  SmolVLM is not intended for high-stakes scenarios or critical decision-making processes that affect an individual's well-being or livelihood. The model may produce content that appears factual but may not be accurate. Misuse includes, but is not limited to:
@@ -82,11 +80,11 @@ We release the SmolVLM2 checkpoints under the Apache 2.0 license.
82
 
83
  ## Training Data
84
 
85
- SmolVLM2 used 3.3M samples for training originally from ten different datasets: : LlaVa Onevision, M4-Instruct, Mammoth, LlaVa Video 178K, FineVideo, VideoStar, VRipt, Vista-400K, MovieChat and ShareGPT4Video.
86
  In the following plots we give a general overview of the samples across modalities and the source of those samples.
87
 
88
  <center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
89
  </center>
90
 
91
- ### Detailed videw
92
  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_datadetails.png" width="auto" height="auto" alt="Image description">
 
30
  ## Resources
31
 
32
  - **Demo:** [Video Highlight Generator](https://huggingface.co/spaces/HuggingFaceTB/SmolVLM2-HighlightGenerator)
33
+ - **Blog:** [Blog post](https://huggingface.co/blog/smolvlm2)
34
 
35
  ## Uses
36
 
37
+ SmolVLM2 can be used for inference on multimodal (video / image / text) tasks where the input consists of text queries along with video or one or more images. Text and media files can be interleaved arbitrarily, enabling tasks like captioning, visual question answering, and storytelling based on visual content. The model does not support image or video generation.
38
 
39
  To fine-tune SmolVLM2 on a specific task, you can follow [the fine-tuning tutorial](UPDATE).
40
 
 
53
 
54
  You can use transformers to load, infer and fine-tune SmolVLM.
55
 
56
+ [TODO]
57
 
58
 
59
  ### Model optimizations
60
 
 
 
61
  ## Misuse and Out-of-scope Use
62
 
63
  SmolVLM is not intended for high-stakes scenarios or critical decision-making processes that affect an individual's well-being or livelihood. The model may produce content that appears factual but may not be accurate. Misuse includes, but is not limited to:
 
80
 
81
  ## Training Data
82
 
83
+ SmolVLM2 used 3.3M samples for training originally from ten different datasets: [LlaVa Onevision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [M4-Instruct](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data), [Mammoth](https://huggingface.co/datasets/MAmmoTH-VL/MAmmoTH-VL-Instruct-12M), [LlaVa Video 178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K), [FineVideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo), [VideoStar](https://huggingface.co/datasets/orrzohar/Video-STaR), [VRipt](https://huggingface.co/datasets/Mutonix/Vript), [Vista-400K](https://huggingface.co/datasets/TIGER-Lab/VISTA-400K), [MovieChat](https://huggingface.co/datasets/Enxin/MovieChat-1K_train) and [ShareGPT4Video](https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video).
84
  In the following plots we give a general overview of the samples across modalities and the source of those samples.
85
 
86
  <center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
87
  </center>
88
 
89
+ ### Details
90
  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_datadetails.png" width="auto" height="auto" alt="Image description">