RedHatAI
/

Meta-Llama-3.1-70B-Instruct-quantized.w4a16

@@ -31,8 +31,9 @@ base_model: meta-llama/Meta-Llama-3.1-70B-Instruct
 - **License(s):** Llama3.1
 - **Model Developers:** Neural Magic
-Quantized version of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
-It achieves scores within 1.4% of the scores of the unquantized model for MMLU, ARC-Challenge, GSM-8k, Hellaswag, Winogrande, and TruthfulQA.
 ### Model Optimizations
@@ -131,15 +132,15 @@ model.save_pretrained("Meta-Llama-3.1-70B-Instruct-quantized.w4a16")
 ## Evaluation
-The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
 Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
-This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals).
 **Note:** Results have been updated after Meta modified the chat template.
 ### Accuracy
-#### Open LLM Leaderboard evaluation scores
 <table>
   <tr>
    <td><strong>Benchmark</strong>
@@ -151,6 +152,20 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
    <td><strong>Recovery</strong>
    </td>
   </tr>
   <tr>
    <td>MMLU (5-shot)
    </td>
@@ -231,6 +246,104 @@ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challen
    <td><strong>99.4%</strong>
    </td>
   </tr>
 </table>
 ### Reproduction

 - **License(s):** Llama3.1
 - **Model Developers:** Neural Magic
+This model is a quantized version of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
+It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model, including multiple-choice, math reasoning, and open-ended text generation.
+Meta-Llama-3.1-70B-Instruct-quantized.w4a16 achieves 100.0% recovery for the Arena-Hard evaluation, 99.4% for OpenLLM v1 (using Meta's prompting when available), 97.4% for OpenLLM v2, 101.0% for HumanEval pass@1, and 99.2% for HumanEval+ pass@1.
 ### Model Optimizations
 ## Evaluation
+This model was evaluated on the well-known Arena-Hard, OpenLLM v1, OpenLLM v2, and HumanEval benchmarks.
 Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
+This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals) and a few fixes to OpenLLM v2 tasks.
 **Note:** Results have been updated after Meta modified the chat template.
 ### Accuracy
 <table>
   <tr>
    <td><strong>Benchmark</strong>
    <td><strong>Recovery</strong>
    </td>
   </tr>
+  <tr>
+   <td><strong>Arena Hard</strong>
+   </td>
+   <td>57.0 (55.8 / 58.2)
+   </td>
+   <td>57.0 (57.1 / 56.8)
+   </td>
+   <td>100.0%
+   </td>
+  </tr>
+  <tr>
+   <td><strong>OpenLLM v1</strong>
+   </td>
+  </tr>
   <tr>
    <td>MMLU (5-shot)
    </td>
    <td><strong>99.4%</strong>
    </td>
   </tr>
+  <tr>
+   <td><strong>OpenLLM v2</strong>
+   </td>
+  </tr>
+  <tr>
+   <td>MMLU-Pro
+   </td>
+   <td>48.12
+   </td>
+   <td>47.25
+   </td>
+   <td>98.2%
+   </td>
+  </tr>
+  <tr>
+   <td>IFEval
+   </td>
+   <td>86.41
+   </td>
+   <td>85.74
+   </td>
+   <td>99.2%
+   </td>
+  </tr>
+  <tr>
+   <td>BBH
+   </td>
+   <td>55.79
+   </td>
+   <td>55.01
+   </td>
+   <td>98.6%
+   </td>
+  </tr>
+  <tr>
+   <td>Math |v| 5
+   </td>
+   <td>26.07
+   </td>
+   <td>24.38
+   </td>
+   <td>93.5%
+   </td>
+  </tr>
+  <tr>
+   <td>GPQA ()
+   </td>
+   <td>15.40
+   </td>
+   <td>13.85
+   </td>
+   <td>89.9%
+   </td>
+  </tr>
+  <tr>
+   <td>MuSR (5-shot)
+   </td>
+   <td>18.16
+   </td>
+   <td>17.25
+   </td>
+   <td>95.0%
+   </td>
+  </tr>
+  <tr>
+   <td><strong>Average</strong>
+   </td>
+   <td><strong>41.7</strong>
+   </td>
+   <td><strong>40.6</strong>
+   </td>
+   <td><strong>97.4%</strong>
+   </td>
+  </tr>
+  <tr>
+   <td><strong>Coding</strong>
+   </td>
+  </tr>
+  <tr>
+   <td>HumanEval pass@1
+   </td>
+   <td>79.7
+   </td>
+   <td>80.5
+   </td>
+   <td>101.0%
+   </td>
+  </tr>
+  <tr>
+   <td>HumanEval+ pass@1
+   </td>
+   <td>74.8
+   </td>
+   <td>74.2
+   </td>
+   <td>99.2%
+   </td>
+  </tr>
 </table>
 ### Reproduction