alexmarques commited on
Commit
3037bb5
·
verified ·
1 Parent(s): 0fd0781

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +178 -22
README.md CHANGED
@@ -32,8 +32,9 @@ base_model: meta-llama/Meta-Llama-3.1-405B-Instruct
32
  - **License(s):** Llama3.1
33
  - **Model Developers:** Neural Magic
34
 
35
- Quantized version of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct).
36
- It achieves scores within 1.3% of the scores of the unquantized model for MMLU, ARC-Challenge, GSM-8k, Hellaswag, Winogrande and TruthfulQA.
 
37
 
38
  ### Model Optimizations
39
 
@@ -142,18 +143,26 @@ oneshot(
142
  model.save_pretrained("Meta-Llama-3.1-405B-Instruct-quantized.w8a8")
143
  ```
144
 
145
-
146
  ## Evaluation
147
 
148
- The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
149
- Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
150
- This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-405B-Instruct-evals).
 
 
 
 
 
 
 
 
 
 
151
 
152
  **Note:** Results have been updated after Meta modified the chat template.
153
 
154
  ### Accuracy
155
 
156
- #### Open LLM Leaderboard evaluation scores
157
  <table>
158
  <tr>
159
  <td><strong>Benchmark</strong>
@@ -165,12 +174,26 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
165
  <td><strong>Recovery</strong>
166
  </td>
167
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
168
  <tr>
169
  <td>MMLU (5-shot)
170
  </td>
171
- <td>87.38
172
  </td>
173
- <td>87.05
174
  </td>
175
  <td>99.6%
176
  </td>
@@ -178,9 +201,9 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
178
  <tr>
179
  <td>ARC Challenge (0-shot)
180
  </td>
181
- <td>94.97
182
  </td>
183
- <td>94.37
184
  </td>
185
  <td>99.4%
186
  </td>
@@ -188,9 +211,9 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
188
  <tr>
189
  <td>GSM-8K (CoT, 8-shot, strict-match)
190
  </td>
191
- <td>96.44
192
  </td>
193
- <td>95.45
194
  </td>
195
  <td>99.0%
196
  </td>
@@ -198,9 +221,9 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
198
  <tr>
199
  <td>Hellaswag (10-shot)
200
  </td>
201
- <td>88.33
202
  </td>
203
- <td>88.15
204
  </td>
205
  <td>99.8%
206
  </td>
@@ -208,9 +231,9 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
208
  <tr>
209
  <td>Winogrande (5-shot)
210
  </td>
211
- <td>87.21
212
  </td>
213
- <td>86.11
214
  </td>
215
  <td>98.7%
216
  </td>
@@ -218,9 +241,9 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
218
  <tr>
219
  <td>TruthfulQA (0-shot)
220
  </td>
221
- <td>64.64
222
  </td>
223
- <td>64.39
224
  </td>
225
  <td>99.6%
226
  </td>
@@ -228,13 +251,111 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
228
  <tr>
229
  <td><strong>Average</strong>
230
  </td>
231
- <td><strong>86.75</strong>
232
  </td>
233
- <td><strong>86.15</strong>
234
  </td>
235
  <td><strong>99.3%</strong>
236
  </td>
237
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
238
  </table>
239
 
240
  ### Reproduction
@@ -315,4 +436,39 @@ lm_eval \
315
  --tasks truthfulqa \
316
  --num_fewshot 0 \
317
  --batch_size auto
318
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  - **License(s):** Llama3.1
33
  - **Model Developers:** Neural Magic
34
 
35
+ This model is a quantized version of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct).
36
+ It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model, including multiple-choice, math reasoning, and open-ended text generation.
37
+ Meta-Llama-3.1-405B-Instruct-FP8-dynamic achieves 95.8% recovery for the Arena-Hard evaluation, 99.3% for OpenLLM v1 (using Meta's prompting when available), 98.4% for OpenLLM v2, 100.1% for HumanEval pass@1, and 100.4% for HumanEval+ pass@1.
38
 
39
  ### Model Optimizations
40
 
 
143
  model.save_pretrained("Meta-Llama-3.1-405B-Instruct-quantized.w8a8")
144
  ```
145
 
 
146
  ## Evaluation
147
 
148
+ This model was evaluated on the well-known Arena-Hard, OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval+ benchmarks.
149
+ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine.
150
+
151
+ Arena-Hard evaluations were conducted using the [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) repository.
152
+ The model generated a single answer for each prompt form Arena-Hard, and each answer was judged twice by GPT-4.
153
+ We report below the scores obtained in each judgement and the average.
154
+
155
+ OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct).
156
+ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals) and a few fixes to OpenLLM v2 tasks.
157
+
158
+ HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the [EvalPlus](https://github.com/neuralmagic/evalplus) repository.
159
+
160
+ Detailed model outputs are available as HuggingFace datasets for [Arena-Hard](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-arena-hard-evals), [OpenLLM v2](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-leaderboard-v2-evals), and [HumanEval](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-humaneval-evals).
161
 
162
  **Note:** Results have been updated after Meta modified the chat template.
163
 
164
  ### Accuracy
165
 
 
166
  <table>
167
  <tr>
168
  <td><strong>Benchmark</strong>
 
174
  <td><strong>Recovery</strong>
175
  </td>
176
  </tr>
177
+ <tr>
178
+ <td><strong>Arena Hard</strong>
179
+ </td>
180
+ <td>67.4 (67.3 / 67.5)
181
+ </td>
182
+ <td>64.6 (64.3 / 64.8)
183
+ </td>
184
+ <td>95.8%
185
+ </td>
186
+ </tr>
187
+ <tr>
188
+ <td><strong>OpenLLM v1</strong>
189
+ </td>
190
+ </tr>
191
  <tr>
192
  <td>MMLU (5-shot)
193
  </td>
194
+ <td>87.4
195
  </td>
196
+ <td>87.1
197
  </td>
198
  <td>99.6%
199
  </td>
 
201
  <tr>
202
  <td>ARC Challenge (0-shot)
203
  </td>
204
+ <td>95.0
205
  </td>
206
+ <td>94.4
207
  </td>
208
  <td>99.4%
209
  </td>
 
211
  <tr>
212
  <td>GSM-8K (CoT, 8-shot, strict-match)
213
  </td>
214
+ <td>96.4
215
  </td>
216
+ <td>95.5
217
  </td>
218
  <td>99.0%
219
  </td>
 
221
  <tr>
222
  <td>Hellaswag (10-shot)
223
  </td>
224
+ <td>88.3
225
  </td>
226
+ <td>88.2
227
  </td>
228
  <td>99.8%
229
  </td>
 
231
  <tr>
232
  <td>Winogrande (5-shot)
233
  </td>
234
+ <td>87.2
235
  </td>
236
+ <td>86.1
237
  </td>
238
  <td>98.7%
239
  </td>
 
241
  <tr>
242
  <td>TruthfulQA (0-shot)
243
  </td>
244
+ <td>64.6
245
  </td>
246
+ <td>64.4
247
  </td>
248
  <td>99.6%
249
  </td>
 
251
  <tr>
252
  <td><strong>Average</strong>
253
  </td>
254
+ <td><strong>86.8</strong>
255
  </td>
256
+ <td><strong>86.2</strong>
257
  </td>
258
  <td><strong>99.3%</strong>
259
  </td>
260
  </tr>
261
+ <tr>
262
+ <td><strong>OpenLLM v2</strong>
263
+ </td>
264
+ </tr>
265
+ <tr>
266
+ <td>MMLU-Pro (5-shot)
267
+ </td>
268
+ <td>59.7
269
+ </td>
270
+ <td>58.4
271
+ </td>
272
+ <td>97.8%
273
+ </td>
274
+ </tr>
275
+ <tr>
276
+ <td>IFEval (0-shot)
277
+ </td>
278
+ <td>87.7
279
+ </td>
280
+ <td>87.0
281
+ </td>
282
+ <td>99.2%
283
+ </td>
284
+ </tr>
285
+ <tr>
286
+ <td>BBH (3-shot)
287
+ </td>
288
+ <td>67.0
289
+ </td>
290
+ <td>66.7
291
+ </td>
292
+ <td>99.6%
293
+ </td>
294
+ </tr>
295
+ <tr>
296
+ <td>Math-|v|-5 (4-shot)
297
+ </td>
298
+ <td>39.0
299
+ </td>
300
+ <td>35.8
301
+ </td>
302
+ <td>91.9%
303
+ </td>
304
+ </tr>
305
+ <tr>
306
+ <td>GPQA (0-shot)
307
+ </td>
308
+ <td>19.5
309
+ </td>
310
+ <td>20.4
311
+ </td>
312
+ <td>104.5%
313
+ </td>
314
+ </tr>
315
+ <tr>
316
+ <td>MuSR (0-shot)
317
+ </td>
318
+ <td>19.5
319
+ </td>
320
+ <td>19.2
321
+ </td>
322
+ <td>98.8%
323
+ </td>
324
+ </tr>
325
+ <tr>
326
+ <td><strong>Average</strong>
327
+ </td>
328
+ <td><strong>48.7</strong>
329
+ </td>
330
+ <td><strong>47.9</strong>
331
+ </td>
332
+ <td><strong>98.4%</strong>
333
+ </td>
334
+ </tr>
335
+ <tr>
336
+ <td><strong>Coding</strong>
337
+ </td>
338
+ </tr>
339
+ <tr>
340
+ <td>HumanEval pass@1
341
+ </td>
342
+ <td>86.8
343
+ </td>
344
+ <td>86.9
345
+ </td>
346
+ <td>100.1%
347
+ </td>
348
+ </tr>
349
+ <tr>
350
+ <td>HumanEval+ pass@1
351
+ </td>
352
+ <td>80.1
353
+ </td>
354
+ <td>80.4
355
+ </td>
356
+ <td>100.4%
357
+ </td>
358
+ </tr>
359
  </table>
360
 
361
  ### Reproduction
 
436
  --tasks truthfulqa \
437
  --num_fewshot 0 \
438
  --batch_size auto
439
+ ```
440
+
441
+ #### OpenLLM v2
442
+ ```
443
+ lm_eval \
444
+ --model vllm \
445
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w8a8",dtype=auto,max_model_len=4096,tensor_parallel_size=8,enable_chunked_prefill=True \
446
+ --apply_chat_template \
447
+ --fewshot_as_multiturn \
448
+ --tasks leaderboard \
449
+ --batch_size auto
450
+ ```
451
+
452
+ #### HumanEval and HumanEval+
453
+ ##### Generation
454
+ ```
455
+ python3 codegen/generate.py \
456
+ --model neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w8a8 \
457
+ --bs 16 \
458
+ --temperature 0.2 \
459
+ --n_samples 50 \
460
+ --root "." \
461
+ --dataset humaneval \
462
+ --tp 8
463
+ ```
464
+ ##### Sanitization
465
+ ```
466
+ python3 evalplus/sanitize.py \
467
+ humaneval/neuralmagic--Meta-Llama-3.1-405B-Instruct-quantized.w8a8_vllm_temp_0.2
468
+ ```
469
+ ##### Evaluation
470
+ ```
471
+ evalplus.evaluate \
472
+ --dataset humaneval \
473
+ --samples humaneval/neuralmagic--Meta-Llama-3.1-405B-Instruct-quantized.w8a8_vllm_temp_0.2-sanitized
474
+ ```