alexmarques commited on
Commit
3f1f2d2
·
verified ·
1 Parent(s): 7e47b04

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -2
README.md CHANGED
@@ -132,9 +132,11 @@ model.save_pretrained("Meta-Llama-3.1-70B-Instruct-quantized.w4a16")
132
 
133
  ## Evaluation
134
 
135
- This model was evaluated on the well-known Arena-Hard, OpenLLM v1, OpenLLM v2, and HumanEval benchmarks.
136
  In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine.
137
  Arena-Hard evaluations were conducted using the [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) repository.
 
 
138
  OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct).
139
  This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals) and a few fixes to OpenLLM v2 tasks.
140
  HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the [EvalPlus](https://github.com/neuralmagic/evalplus) repository.
@@ -144,7 +146,6 @@ Detailed model outputs are available as HuggingFace datasets for [Arena-Hard](ht
144
 
145
  ### Accuracy
146
 
147
-
148
  <table>
149
  <tr>
150
  <td><strong>Benchmark</strong>
 
132
 
133
  ## Evaluation
134
 
135
+ This model was evaluated on the well-known Arena-Hard, OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval+ benchmarks.
136
  In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine.
137
  Arena-Hard evaluations were conducted using the [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) repository.
138
+ The model generated a single answer for each prompt form Arena-Hard, and each answer was judged twice by GPT-4.
139
+ We report below the scores obtained in each judgement and the average.
140
  OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct).
141
  This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals) and a few fixes to OpenLLM v2 tasks.
142
  HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the [EvalPlus](https://github.com/neuralmagic/evalplus) repository.
 
146
 
147
  ### Accuracy
148
 
 
149
  <table>
150
  <tr>
151
  <td><strong>Benchmark</strong>