alexmarques commited on
Commit
1b5b86d
·
verified ·
1 Parent(s): 8a5c7ef

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +448 -0
README.md ADDED
@@ -0,0 +1,448 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ pipeline_tag: text-generation
5
+ base_model:
6
+ - Qwen/Qwen3-8B
7
+ tags:
8
+ - neuralmagic
9
+ - redhat
10
+ - llmcompressor
11
+ - quantized
12
+ - INT4
13
+ ---
14
+
15
+ # Qwen3-8B-quantized.w4a16
16
+
17
+ ## Model Overview
18
+ - **Model Architecture:** Qwen3ForCausalLM
19
+ - **Input:** Text
20
+ - **Output:** Text
21
+ - **Model Optimizations:**
22
+ - **Weight quantization:** INT4
23
+ - **Intended Use Cases:**
24
+ - Reasoning.
25
+ - Function calling.
26
+ - Subject matter experts via fine-tuning.
27
+ - Multilingual instruction following.
28
+ - Translation.
29
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
30
+ - **Release Date:** 05/05/2025
31
+ - **Version:** 1.0
32
+ - **Model Developers:** RedHat (Neural Magic)
33
+
34
+ ### Model Optimizations
35
+
36
+ This model was obtained by quantizing the weights of [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) to INT4 data type.
37
+ This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
38
+
39
+ Only the weights of the linear operators within transformers blocks are quantized.
40
+ Weights are quantized using a asymmetric per-group scheme, with group size 64.
41
+ The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
42
+
43
+
44
+ ## Deployment
45
+
46
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
47
+
48
+ ```python
49
+ from vllm import LLM, SamplingParams
50
+ from transformers import AutoTokenizer
51
+
52
+ model_id = "RedHatAI/Qwen3-8B-quantized.w4a16"
53
+ number_gpus = 1
54
+ sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=20, min_p=0, max_tokens=256)
55
+
56
+ messages = [
57
+ {"role": "user", "content": prompt}
58
+ ]
59
+
60
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
61
+
62
+ messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]
63
+
64
+ prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
65
+
66
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
67
+
68
+ outputs = llm.generate(prompts, sampling_params)
69
+
70
+ generated_text = outputs[0].outputs[0].text
71
+ print(generated_text)
72
+ ```
73
+
74
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
75
+
76
+ ## Creation
77
+
78
+ <details>
79
+ <summary>Creation details</summary>
80
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
81
+
82
+
83
+ ```python
84
+ from llmcompressor.modifiers.quantization import GPTQModifier
85
+ from llmcompressor.transformers import oneshot
86
+ from transformers import AutoModelForCausalLM, AutoTokenizer
87
+
88
+ # Load model
89
+ model_stub = "Qwen/Qwen3-8B"
90
+ model_name = model_stub.split("/")[-1]
91
+
92
+ num_samples = 1024
93
+ max_seq_len = 8192
94
+
95
+ model = AutoModelForCausalLM.from_pretrained(model_stub)
96
+
97
+ tokenizer = AutoTokenizer.from_pretrained(model_stub)
98
+
99
+ def preprocess_fn(example):
100
+ return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
101
+
102
+ ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
103
+ ds = ds.map(preprocess_fn)
104
+
105
+ # Configure the quantization algorithm and scheme
106
+ recipe = GPTQModifier(
107
+ ignore=["lm_head"],
108
+ sequential_targets=["Qwen3DecoderLayer"],
109
+ targets="Linear",
110
+ dampening_frac=0.01,
111
+ scheme="W4A16",
112
+ )
113
+
114
+ # Apply quantization
115
+ oneshot(
116
+ model=model,
117
+ dataset=ds,
118
+ recipe=recipe,
119
+ max_seq_length=max_seq_len,
120
+ num_calibration_samples=num_samples,
121
+ )
122
+
123
+ # Save to disk in compressed-tensors format
124
+ save_path = model_name + "-quantized.w4a16"
125
+ model.save_pretrained(save_path)
126
+ tokenizer.save_pretrained(save_path)
127
+ print(f"Model and tokenizer saved to: {save_path}")
128
+ ```
129
+ </details>
130
+
131
+
132
+
133
+ ## Evaluation
134
+
135
+ The model was evaluated on the OpenLLM leaderboard tasks (versions 1 and 2), using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), and on reasoning tasks using [lighteval](https://github.com/neuralmagic/lighteval/tree/reasoning).
136
+ [vLLM](https://docs.vllm.ai/en/stable/) was used for all evaluations.
137
+
138
+ <details>
139
+ <summary>Evaluation details</summary>
140
+
141
+ **lm-evaluation-harness**
142
+ ```
143
+ lm_eval \
144
+ --model vllm \
145
+ --model_args pretrained="RedHatAI/Qwen3-8B-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=1 \
146
+ --tasks openllm \
147
+ --apply_chat_template\
148
+ --fewshot_as_multiturn \
149
+ --batch_size auto
150
+ ```
151
+
152
+ ```
153
+ lm_eval \
154
+ --model vllm \
155
+ --model_args pretrained="RedHatAI/Qwen3-8B-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=1 \
156
+ --tasks mgsm \
157
+ --apply_chat_template\
158
+ --batch_size auto
159
+ ```
160
+
161
+ ```
162
+ lm_eval \
163
+ --model vllm \
164
+ --model_args pretrained="RedHatAI/Qwen3-8B-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=16384,enable_chunk_prefill=True,tensor_parallel_size=1 \
165
+ --tasks leaderboard \
166
+ --apply_chat_template\
167
+ --fewshot_as_multiturn \
168
+ --batch_size auto
169
+ ```
170
+
171
+ **lighteval**
172
+
173
+ lighteval_model_arguments.yaml
174
+ ```yaml
175
+ model_parameters:
176
+ model_name: RedHatAI/Qwen3-8B-quantized.w4a16
177
+ dtype: auto
178
+ gpu_memory_utilization: 0.9
179
+ max_model_length: 40960
180
+ generation_parameters:
181
+ temperature: 0.6
182
+ top_k: 20
183
+ min_p: 0.0
184
+ top_p: 0.95
185
+ max_new_tokens: 32768
186
+ ```
187
+
188
+ ```
189
+ lighteval vllm \
190
+ --model_args lighteval_model_arguments.yaml \
191
+ --tasks lighteval|aime24|0|0 \
192
+ --use_chat_template = true
193
+ ```
194
+
195
+ ```
196
+ lighteval vllm \
197
+ --model_args lighteval_model_arguments.yaml \
198
+ --tasks lighteval|aime25|0|0 \
199
+ --use_chat_template = true
200
+ ```
201
+
202
+ ```
203
+ lighteval vllm \
204
+ --model_args lighteval_model_arguments.yaml \
205
+ --tasks lighteval|math_500|0|0 \
206
+ --use_chat_template = true
207
+ ```
208
+
209
+ ```
210
+ lighteval vllm \
211
+ --model_args lighteval_model_arguments.yaml \
212
+ --tasks lighteval|gpqa:diamond|0|0 \
213
+ --use_chat_template = true
214
+ ```
215
+
216
+ ```
217
+ lighteval vllm \
218
+ --model_args lighteval_model_arguments.yaml \
219
+ --tasks extended|lcb:codegeneration \
220
+ --use_chat_template = true
221
+ ```
222
+
223
+ </details>
224
+
225
+ ### Accuracy
226
+
227
+ <table>
228
+ <tr>
229
+ <th>Category
230
+ </th>
231
+ <th>Benchmark
232
+ </th>
233
+ <th>Qwen3-8B
234
+ </th>
235
+ <th>Qwen3-8B-quantized.w4a16<br>(this model)
236
+ </th>
237
+ <th>Recovery
238
+ </th>
239
+ </tr>
240
+ <tr>
241
+ <td rowspan="7" ><strong>OpenLLM v1</strong>
242
+ </td>
243
+ <td>MMLU (5-shot)
244
+ </td>
245
+ <td>71.95
246
+ </td>
247
+ <td>69.74
248
+ </td>
249
+ <td>96.9%
250
+ </td>
251
+ </tr>
252
+ <tr>
253
+ <td>ARC Challenge (25-shot)
254
+ </td>
255
+ <td>61.69
256
+ </td>
257
+ <td>61.77
258
+ </td>
259
+ <td>100.1%
260
+ </td>
261
+ </tr>
262
+ <tr>
263
+ <td>GSM-8K (5-shot, strict-match)
264
+ </td>
265
+ <td>75.97
266
+ </td>
267
+ <td>78.62
268
+ </td>
269
+ <td>103.5%
270
+ </td>
271
+ </tr>
272
+ <tr>
273
+ <td>Hellaswag (10-shot)
274
+ </td>
275
+ <td>56.52
276
+ </td>
277
+ <td>57.79
278
+ </td>
279
+ <td>102.2%
280
+ </td>
281
+ </tr>
282
+ <tr>
283
+ <td>Winogrande (5-shot)
284
+ </td>
285
+ <td>65.98
286
+ </td>
287
+ <td>66.22
288
+ </td>
289
+ <td>100.4%
290
+ </td>
291
+ </tr>
292
+ <tr>
293
+ <td>TruthfulQA (0-shot, mc2)
294
+ </td>
295
+ <td>53.17
296
+ </td>
297
+ <td>53.71
298
+ </td>
299
+ <td>101.0%
300
+ </td>
301
+ </tr>
302
+ <tr>
303
+ <td><strong>Average</strong>
304
+ </td>
305
+ <td><strong>64.21</strong>
306
+ </td>
307
+ <td><strong>64.64</strong>
308
+ </td>
309
+ <td><strong>100.7%</strong>
310
+ </td>
311
+ </tr>
312
+ <tr>
313
+ <td rowspan="7" ><strong>OpenLLM v2</strong>
314
+ </td>
315
+ <td>MMLU-Pro (5-shot)
316
+ </td>
317
+ <td>34.57
318
+ </td>
319
+ <td>25.71
320
+ </td>
321
+ <td>74.4%
322
+ </td>
323
+ </tr>
324
+ <tr>
325
+ <td>IFEval (0-shot)
326
+ </td>
327
+ <td>84.77
328
+ </td>
329
+ <td>85.44
330
+ </td>
331
+ <td>100.8%
332
+ </td>
333
+ </tr>
334
+ <tr>
335
+ <td>BBH (3-shot)
336
+ </td>
337
+ <td>25.47
338
+ </td>
339
+ <td>21.17
340
+ </td>
341
+ <td>83.1%
342
+ </td>
343
+ </tr>
344
+ <tr>
345
+ <td>Math-lvl-5 (4-shot)
346
+ </td>
347
+ <td>51.05
348
+ </td>
349
+ <td>51.38
350
+ </td>
351
+ <td>100.7%
352
+ </td>
353
+ </tr>
354
+ <tr>
355
+ <td>GPQA (0-shot)
356
+ </td>
357
+ <td>0.00
358
+ </td>
359
+ <td>0.00
360
+ </td>
361
+ <td>---
362
+ </td>
363
+ </tr>
364
+ <tr>
365
+ <td>MuSR (0-shot)
366
+ </td>
367
+ <td>10.02
368
+ </td>
369
+ <td>9.31
370
+ </td>
371
+ <td>---
372
+ </td>
373
+ </tr>
374
+ <tr>
375
+ <td><strong>Average</strong>
376
+ </td>
377
+ <td><strong>34.26</strong>
378
+ </td>
379
+ <td><strong>33.46</strong>
380
+ </td>
381
+ <td><strong>97.7%</strong>
382
+ </td>
383
+ </tr>
384
+ <tr>
385
+ <td><strong>Multilingual</strong>
386
+ </td>
387
+ <td>MGSM (0-shot)
388
+ </td>
389
+ <td>25.97
390
+ </td>
391
+ <td>24.73
392
+ </td>
393
+ <td>95.3%
394
+ </td>
395
+ </tr>
396
+ <tr>
397
+ <td rowspan="6" ><strong>Reasoning<br>(generation)</strong>
398
+ </td>
399
+ <td>AIME 2024
400
+ </td>
401
+ <td>74.58
402
+ </td>
403
+ <td>74.17
404
+ </td>
405
+ <td>99.5%
406
+ </td>
407
+ </tr>
408
+ <tr>
409
+ <td>AIME 2025
410
+ </td>
411
+ <td>65.21
412
+ </td>
413
+ <td>61.98
414
+ </td>
415
+ <td>95.1%
416
+ </td>
417
+ </tr>
418
+ <tr>
419
+ <td>GPQA diamond
420
+ </td>
421
+ <td>58.59
422
+ </td>
423
+ <td>55.56
424
+ </td>
425
+ <td>94.8%
426
+ </td>
427
+ </tr>
428
+ <tr>
429
+ <td>Math-lvl-5
430
+ </td>
431
+ <td>97.60
432
+ </td>
433
+ <td>96.20
434
+ </td>
435
+ <td>98.6%
436
+ </td>
437
+ </tr>
438
+ <tr>
439
+ <td>LiveCodeBench
440
+ </td>
441
+ <td>56.27
442
+ </td>
443
+ <td>52.29
444
+ </td>
445
+ <td>92.9%
446
+ </td>
447
+ </tr>
448
+ </table>