HuggingFaceTB
/

SmolVLM2-500M-Video-Instruct

Video-Text-to-Text

image-text-to-text

Inference Endpoints

Model card Files Files and versions Community

merve HF staff commited on 6 days ago

Commit

155b0bd

·

verified ·

1 Parent(s): 4da1f23

Added snippets

Files changed (1) hide show

README.md +112 -3

README.md CHANGED Viewed

@@ -60,9 +60,118 @@ We evaluated the performance of the SmolVLM2 family on the following scientific
 ### How to get started
-You can use transformers to load, infer and fine-tune SmolVLM.
-[TODO]
 ### Model optimizations

 ### How to get started
+You can use transformers to load, infer and fine-tune SmolVLM. Make sure you have num2words, flash-attn and latest transformers installed.
+You can load the model as follows.
+```python
+from transformers import AutoProcessor, AutoModelForImageTextToText
+processor = AutoProcessor.from_pretrained(model_path)
+model = AutoModelForImageTextToText.from_pretrained(
+    model_path,
+    torch_dtype=torch.bfloat16,
+    _attn_implementation="flash_attention_2"
+).to("cuda")
+```
+#### Simple Inference
+You preprocess your inputs directly using chat templates and directly passing them
+```python
+messages = [
+    {
+        "role": "user",
+        "content": [
+{"type": "text", "text": "What is in this image?"},
+            {"type": "image", "path": "path_to_img.png"},
+        ]
+    },
+]
+inputs = processor.apply_chat_template(
+    messages,
+    add_generation_prompt=True,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt",
+).to(model.device)
+generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
+generated_texts = processor.batch_decode(
+    generated_ids,
+    skip_special_tokens=True,
+)
+print(generated_texts[0])
+```
+#### Video Inference
+To use SmolVLM2 for video inference, make sure you have decord installed.
+```python
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "video", "path": "path_to_video.mp4"},
+            {"type": "text", "text": "Describe this video in detail"}
+        ]
+    },
+]
+inputs = processor.apply_chat_template(
+    messages,
+    add_generation_prompt=True,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt",
+).to(model.device)
+generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
+generated_texts = processor.batch_decode(
+    generated_ids,
+    skip_special_tokens=True,
+)
+print(generated_texts[0])
+```
+#### Multi-image Interleaved Inference
+You can interleave multiple media with text using chat templates.
+```python
+import torch
+messages = [
+    {
+        "role": "user",
+        "content": [
+{"type": "text", "text": "What is the similarity between this image <image>"},
+            {"type": "image", "path": "image_1.png"},
+{"type": "text", "text": "and this image <image>"},
+{"type": "image", "path": "image_2.png"},
+        ]
+    },
+]
+inputs = processor.apply_chat_template(
+    messages,
+    add_generation_prompt=True,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt",
+).to(model.device)
+generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
+generated_texts = processor.batch_decode(
+    generated_ids,
+    skip_special_tokens=True,
+)
+print(generated_texts[0])
+```
 ### Model optimizations