alexmarques commited on
Commit
3ecb94d
·
verified ·
1 Parent(s): b8c4487

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -5
README.md CHANGED
@@ -31,8 +31,9 @@ base_model: meta-llama/Meta-Llama-3.1-70B-Instruct
31
  - **License(s):** Llama3.1
32
  - **Model Developers:** Neural Magic
33
 
34
- Quantized version of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
35
- It achieves scores within 1.4% of the scores of the unquantized model for MMLU, ARC-Challenge, GSM-8k, Hellaswag, Winogrande, and TruthfulQA.
 
36
 
37
  ### Model Optimizations
38
 
@@ -131,15 +132,15 @@ model.save_pretrained("Meta-Llama-3.1-70B-Instruct-quantized.w4a16")
131
 
132
  ## Evaluation
133
 
134
- The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
135
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
136
- This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals).
137
 
138
  **Note:** Results have been updated after Meta modified the chat template.
139
 
140
  ### Accuracy
141
 
142
- #### Open LLM Leaderboard evaluation scores
143
  <table>
144
  <tr>
145
  <td><strong>Benchmark</strong>
@@ -151,6 +152,20 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
151
  <td><strong>Recovery</strong>
152
  </td>
153
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
  <tr>
155
  <td>MMLU (5-shot)
156
  </td>
@@ -231,6 +246,104 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
231
  <td><strong>99.4%</strong>
232
  </td>
233
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
234
  </table>
235
 
236
  ### Reproduction
 
31
  - **License(s):** Llama3.1
32
  - **Model Developers:** Neural Magic
33
 
34
+ This model is a quantized version of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
35
+ It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model, including multiple-choice, math reasoning, and open-ended text generation.
36
+ Meta-Llama-3.1-70B-Instruct-quantized.w4a16 achieves 100.0% recovery for the Arena-Hard evaluation, 99.4% for OpenLLM v1 (using Meta's prompting when available), 97.4% for OpenLLM v2, 101.0% for HumanEval pass@1, and 99.2% for HumanEval+ pass@1.
37
 
38
  ### Model Optimizations
39
 
 
132
 
133
  ## Evaluation
134
 
135
+ This model was evaluated on the well-known Arena-Hard, OpenLLM v1, OpenLLM v2, and HumanEval benchmarks.
136
  Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
137
+ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals) and a few fixes to OpenLLM v2 tasks.
138
 
139
  **Note:** Results have been updated after Meta modified the chat template.
140
 
141
  ### Accuracy
142
 
143
+
144
  <table>
145
  <tr>
146
  <td><strong>Benchmark</strong>
 
152
  <td><strong>Recovery</strong>
153
  </td>
154
  </tr>
155
+ <tr>
156
+ <td><strong>Arena Hard</strong>
157
+ </td>
158
+ <td>57.0 (55.8 / 58.2)
159
+ </td>
160
+ <td>57.0 (57.1 / 56.8)
161
+ </td>
162
+ <td>100.0%
163
+ </td>
164
+ </tr>
165
+ <tr>
166
+ <td><strong>OpenLLM v1</strong>
167
+ </td>
168
+ </tr>
169
  <tr>
170
  <td>MMLU (5-shot)
171
  </td>
 
246
  <td><strong>99.4%</strong>
247
  </td>
248
  </tr>
249
+ <tr>
250
+ <td><strong>OpenLLM v2</strong>
251
+ </td>
252
+ </tr>
253
+ <tr>
254
+ <td>MMLU-Pro
255
+ </td>
256
+ <td>48.12
257
+ </td>
258
+ <td>47.25
259
+ </td>
260
+ <td>98.2%
261
+ </td>
262
+ </tr>
263
+ <tr>
264
+ <td>IFEval
265
+ </td>
266
+ <td>86.41
267
+ </td>
268
+ <td>85.74
269
+ </td>
270
+ <td>99.2%
271
+ </td>
272
+ </tr>
273
+ <tr>
274
+ <td>BBH
275
+ </td>
276
+ <td>55.79
277
+ </td>
278
+ <td>55.01
279
+ </td>
280
+ <td>98.6%
281
+ </td>
282
+ </tr>
283
+ <tr>
284
+ <td>Math |v| 5
285
+ </td>
286
+ <td>26.07
287
+ </td>
288
+ <td>24.38
289
+ </td>
290
+ <td>93.5%
291
+ </td>
292
+ </tr>
293
+ <tr>
294
+ <td>GPQA ()
295
+ </td>
296
+ <td>15.40
297
+ </td>
298
+ <td>13.85
299
+ </td>
300
+ <td>89.9%
301
+ </td>
302
+ </tr>
303
+ <tr>
304
+ <td>MuSR (5-shot)
305
+ </td>
306
+ <td>18.16
307
+ </td>
308
+ <td>17.25
309
+ </td>
310
+ <td>95.0%
311
+ </td>
312
+ </tr>
313
+ <tr>
314
+ <td><strong>Average</strong>
315
+ </td>
316
+ <td><strong>41.7</strong>
317
+ </td>
318
+ <td><strong>40.6</strong>
319
+ </td>
320
+ <td><strong>97.4%</strong>
321
+ </td>
322
+ </tr>
323
+ <tr>
324
+ <td><strong>Coding</strong>
325
+ </td>
326
+ </tr>
327
+ <tr>
328
+ <td>HumanEval pass@1
329
+ </td>
330
+ <td>79.7
331
+ </td>
332
+ <td>80.5
333
+ </td>
334
+ <td>101.0%
335
+ </td>
336
+ </tr>
337
+ <tr>
338
+ <td>HumanEval+ pass@1
339
+ </td>
340
+ <td>74.8
341
+ </td>
342
+ <td>74.2
343
+ </td>
344
+ <td>99.2%
345
+ </td>
346
+ </tr>
347
  </table>
348
 
349
  ### Reproduction