nm-research commited on
Commit
ede6217
·
verified ·
1 Parent(s): f0e9963

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +259 -3
README.md CHANGED
@@ -1,3 +1,259 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - vllm
4
+ - vision
5
+ - audio
6
+ - fp8
7
+ license: mit
8
+ base_model: google/gemma-3n-E4B-it
9
+ library_name: transformers
10
+ ---
11
+
12
+ # RedHatAI/gemma-3n-E4B-it-FP8-Dynamic
13
+
14
+ ## Model Overview
15
+ - **Model Architecture:** gemma-3n-E4B-it
16
+ - **Input:** Audio-Vision-Text
17
+ - **Output:** Text
18
+ - **Model Optimizations:**
19
+ - **Weight quantization:** FP8
20
+ - **Activation quantization:** FP8
21
+ - **Release Date:** 08/01/2025
22
+ - **Version:** 1.0
23
+ - **Model Developers:** RedHatAI
24
+
25
+ Quantized version of [google/gemma-3n-E4B-it](https://huggingface.co/google/gemma-3n-E4B-it).
26
+
27
+ ### Model Optimizations
28
+
29
+ This model was obtained by quantizing the weights of [google/gemma-3n-E4B-it](https://huggingface.co/google/gemma-3n-E4B-it) to FP8 data type, ready for inference with vLLM >= 0.10.0
30
+
31
+ ## Deployment
32
+
33
+ ### Use with vLLM
34
+
35
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
36
+
37
+ ```python
38
+ from vllm.assets.image import ImageAsset
39
+ from vllm import LLM, SamplingParams
40
+
41
+ # prepare model
42
+ llm = LLM(
43
+ model="RedHatAI/gemma-3n-E4B-it-FP8-Dynamic",
44
+ trust_remote_code=True,
45
+ max_model_len=4096,
46
+ max_num_seqs=2,
47
+ )
48
+
49
+ # prepare inputs
50
+ question = "What is the content of this image?"
51
+ inputs = {
52
+ "prompt": f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n",
53
+ "multi_modal_data": {
54
+ "image": ImageAsset("cherry_blossom").pil_image.convert("RGB")
55
+ },
56
+ }
57
+
58
+ # generate response
59
+ print("========== SAMPLE GENERATION ==============")
60
+ outputs = llm.generate(inputs, SamplingParams(temperature=0.2, max_tokens=64))
61
+ print(f"PROMPT : {outputs[0].prompt}")
62
+ print(f"RESPONSE: {outputs[0].outputs[0].text}")
63
+ print("==========================================")
64
+ ```
65
+
66
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
67
+
68
+ ## Creation
69
+
70
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
71
+
72
+ <details>
73
+ <summary>Model Creation Code</summary>
74
+
75
+ ```python
76
+ from llmcompressor import oneshot
77
+ from llmcompressor.modifiers.quantization import QuantizationModifier
78
+ from transformers import AutoProcessor, Gemma3nForConditionalGeneration
79
+
80
+ # Load model.
81
+ model_id = "google/gemma-3n-E4B-it"
82
+ model = Gemma3nForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
83
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
84
+
85
+ # Recipe
86
+ recipe = [
87
+ QuantizationModifier(
88
+ targets="Linear",
89
+ scheme="FP8_DYNAMIC",
90
+ ignore=[
91
+ "re:.*embed_audio.*",
92
+ "re:.*embed_vision.*",
93
+ "re:.*audio_tower.*",
94
+ "re:.*vision_tower.*",
95
+ "re:.*altup.*",
96
+ "re:.*lm_head.*",
97
+ "re:.*laurel.*",
98
+ "re:model\.language_model\.layers\.\d+\.per_layer_input_gate",
99
+ "re:model\.language_model\.layers\.\d+\.per_layer_projection",
100
+ "model.language_model.per_layer_model_projection",
101
+ ],
102
+ ),
103
+ ]
104
+
105
+ SAVE_DIR = f"{model_id.split('/')[1]}-{recipe[0].scheme}"
106
+
107
+ # Perform oneshot
108
+ oneshot(
109
+ model=model,
110
+ tokenizer=model_id,
111
+ recipe=recipe,
112
+ trust_remote_code_model=True,
113
+ tie_word_embeddings=True,
114
+ output_dir=SAVE_DIR,
115
+ )
116
+
117
+ # Save to disk compressed.
118
+ model.save_pretrained(SAVE_DIR, save_compressed=True)
119
+ processor.save_pretrained(SAVE_DIR)
120
+
121
+
122
+ ```
123
+ </details>
124
+
125
+ ## Evaluation
126
+
127
+ The model was evaluated using [lm_evaluation_harness](https://github.com/EleutherAI/lm-evaluation-harness) for OpenLLM V1 and V2 text-based benchmarks. The evaluations were conducted using the following commands:
128
+
129
+ <details>
130
+ <summary>Evaluation Commands</summary>
131
+
132
+ ### OpenLLM V1
133
+
134
+ ```
135
+ lm_eval \
136
+ --model vllm \
137
+ --model_args pretrained="<model_name>",dtype=auto,add_bos_token=false,max_model_len=4096,gpu_memory_utilization=0.8,enable_chunked_prefill=True,enforce_eager=True,trust_remote_code=True \
138
+ --tasks openllm \
139
+ --batch_size auto \
140
+ --apply_chat_template \
141
+ --fewshot_as_multiturn
142
+
143
+ ```
144
+
145
+ ### Leaderboard V2
146
+
147
+ ```
148
+ lm_eval \
149
+ --model vllm \
150
+ --model_args pretrained="<model_name>",dtype=auto,add_bos_token=false,max_model_len=15000,gpu_memory_utilization=0.5,enable_chunked_prefill=True,enforce_eager=True,trust_remote_code=True \
151
+ --tasks leaderboard \
152
+ --batch_size auto \
153
+ --apply_chat_template \
154
+ --fewshot_as_multiturn
155
+
156
+ ```
157
+ </details>
158
+
159
+ ### Accuracy
160
+
161
+ <table>
162
+ <thead>
163
+ <tr>
164
+ <th>Category</th>
165
+ <th>Metric</th>
166
+ <th>google/gemma-3n-E4B-it</th>
167
+ <th>FP8 Dynamic</th>
168
+ <th>Recovery (%)</th>
169
+ </tr>
170
+ </thead>
171
+ <tbody>
172
+ <tr>
173
+ <td rowspan="7"><b>OpenLLM V1</b></td>
174
+ <td>arc_challenge</td>
175
+ <td>60.24</td>
176
+ <td>59.04</td>
177
+ <td>98.01%</td>
178
+ </tr>
179
+ <tr>
180
+ <td>gsm8k</td>
181
+ <td>60.12</td>
182
+ <td>70.81</td>
183
+ <td>117.79%</td>
184
+ </tr>
185
+ <tr>
186
+ <td>hellaswag</td>
187
+ <td>74.94</td>
188
+ <td>73.28</td>
189
+ <td>97.79%</td>
190
+ </tr>
191
+ <tr>
192
+ <td>mmlu</td>
193
+ <td>64.14</td>
194
+ <td>64.82</td>
195
+ <td>101.06%</td>
196
+ </tr>
197
+ <tr>
198
+ <td>truthfulqa_mc2</td>
199
+ <td>54.87</td>
200
+ <td>54.61</td>
201
+ <td>99.53%</td>
202
+ </tr>
203
+ <tr>
204
+ <td>winogrande</td>
205
+ <td>68.35</td>
206
+ <td>67.72</td>
207
+ <td>99.08%</td>
208
+ </tr>
209
+ <tr>
210
+ <td><b>Average</b></td>
211
+ <td>63.78</td>
212
+ <td>65.05</td>
213
+ <td><b>101.99%</b></td>
214
+ </tr>
215
+ <tr>
216
+ <td rowspan="7"><b>Leaderboard</b></td>
217
+ <td>bbh</td>
218
+ <td>55.46</td>
219
+ <td>55.20</td>
220
+ <td>99.53%</td>
221
+ </tr>
222
+ <tr>
223
+ <td>mmlu_pro</td>
224
+ <td>34.38</td>
225
+ <td>34.28</td>
226
+ <td>99.71%</td>
227
+ </tr>
228
+ <tr>
229
+ <td>musr</td>
230
+ <td>33.20</td>
231
+ <td>34.26</td>
232
+ <td>103.19%</td>
233
+ </tr>
234
+ <tr>
235
+ <td>ifeval</td>
236
+ <td>84.41</td>
237
+ <td>83.93</td>
238
+ <td>99.43%</td>
239
+ </tr>
240
+ <tr>
241
+ <td>gpqa</td>
242
+ <td>30.87</td>
243
+ <td>31.38</td>
244
+ <td>101.65%</td>
245
+ </tr>
246
+ <tr>
247
+ <td>math_hard</td>
248
+ <td>45.54</td>
249
+ <td>46.60</td>
250
+ <td>102.33%</td>
251
+ </tr>
252
+ <tr>
253
+ <td><b>Average</b></td>
254
+ <td>47.31</td>
255
+ <td>47.61</td>
256
+ <td><b>100.63%</b></td>
257
+ </tr>
258
+ </tbody>
259
+ </table>