merve HF staff commited on
Commit
155b0bd
·
verified ·
1 Parent(s): 4da1f23

Added snippets

Browse files
Files changed (1) hide show
  1. README.md +112 -3
README.md CHANGED
@@ -60,9 +60,118 @@ We evaluated the performance of the SmolVLM2 family on the following scientific
60
 
61
  ### How to get started
62
 
63
- You can use transformers to load, infer and fine-tune SmolVLM.
64
-
65
- [TODO]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
 
68
  ### Model optimizations
 
60
 
61
  ### How to get started
62
 
63
+ You can use transformers to load, infer and fine-tune SmolVLM. Make sure you have num2words, flash-attn and latest transformers installed.
64
+ You can load the model as follows.
65
+
66
+ ```python
67
+ from transformers import AutoProcessor, AutoModelForImageTextToText
68
+
69
+ processor = AutoProcessor.from_pretrained(model_path)
70
+ model = AutoModelForImageTextToText.from_pretrained(
71
+ model_path,
72
+ torch_dtype=torch.bfloat16,
73
+ _attn_implementation="flash_attention_2"
74
+ ).to("cuda")
75
+ ```
76
+
77
+ #### Simple Inference
78
+
79
+ You preprocess your inputs directly using chat templates and directly passing them
80
+
81
+ ```python
82
+ messages = [
83
+ {
84
+ "role": "user",
85
+ "content": [
86
+ {"type": "text", "text": "What is in this image?"},
87
+ {"type": "image", "path": "path_to_img.png"},
88
+
89
+ ]
90
+ },
91
+ ]
92
+
93
+ inputs = processor.apply_chat_template(
94
+ messages,
95
+ add_generation_prompt=True,
96
+ tokenize=True,
97
+ return_dict=True,
98
+ return_tensors="pt",
99
+ ).to(model.device)
100
+
101
+ generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
102
+ generated_texts = processor.batch_decode(
103
+ generated_ids,
104
+ skip_special_tokens=True,
105
+ )
106
+ print(generated_texts[0])
107
+ ```
108
+
109
+ #### Video Inference
110
+
111
+ To use SmolVLM2 for video inference, make sure you have decord installed.
112
+
113
+ ```python
114
+ messages = [
115
+ {
116
+ "role": "user",
117
+ "content": [
118
+ {"type": "video", "path": "path_to_video.mp4"},
119
+ {"type": "text", "text": "Describe this video in detail"}
120
+ ]
121
+ },
122
+ ]
123
+
124
+ inputs = processor.apply_chat_template(
125
+ messages,
126
+ add_generation_prompt=True,
127
+ tokenize=True,
128
+ return_dict=True,
129
+ return_tensors="pt",
130
+ ).to(model.device)
131
+
132
+ generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
133
+ generated_texts = processor.batch_decode(
134
+ generated_ids,
135
+ skip_special_tokens=True,
136
+ )
137
+
138
+ print(generated_texts[0])
139
+ ```
140
+ #### Multi-image Interleaved Inference
141
+
142
+ You can interleave multiple media with text using chat templates.
143
+
144
+ ```python
145
+ import torch
146
+
147
+
148
+ messages = [
149
+ {
150
+ "role": "user",
151
+ "content": [
152
+ {"type": "text", "text": "What is the similarity between this image <image>"},
153
+
154
+ {"type": "image", "path": "image_1.png"},
155
+ {"type": "text", "text": "and this image <image>"},
156
+ {"type": "image", "path": "image_2.png"},
157
+ ]
158
+ },
159
+ ]
160
+ inputs = processor.apply_chat_template(
161
+ messages,
162
+ add_generation_prompt=True,
163
+ tokenize=True,
164
+ return_dict=True,
165
+ return_tensors="pt",
166
+ ).to(model.device)
167
+
168
+ generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
169
+ generated_texts = processor.batch_decode(
170
+ generated_ids,
171
+ skip_special_tokens=True,
172
+ )
173
+ print(generated_texts[0])
174
+ ```
175
 
176
 
177
  ### Model optimizations