BubbleQ commited on
Commit
bb32577
·
verified ·
1 Parent(s): 012a443

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -6
README.md CHANGED
@@ -116,18 +116,17 @@ Note:
116
  | **Math** | MATH500 | 86.4 | 68.4 | 79.8 | 85 | 86.8 | 80.6 | 97.2 |
117
  | | AIME24 | 28.33 | 11.25 | 22.92 | 28.33 | 23.96 | 15.83 | 75 |
118
  | | AIME25 | 19.17 | 8.12 | 15.21 | 20.62 | 18.33 | 18.75 | 61.88 |
119
- | **Code** | HumanEval | 86.59 | 82.3* | 74.39 | 83.54 | 82.32 | 85.37 | 81.71 |
120
- | | HumanEval+ | 79.27 | - | 70.12 | 76.83 | 75.61 | 83.54 | 76.83 |
121
- | | MBPPEvalplus | 79.9 | 62.4 | 82 | 76.2 | 85.7 | 77.5 | 89.4 |
122
- | | MBPPEvalplus++ | 68.8 | 50.4 | 69.3 | 66.1 | 74.1 | 66.7 | 75.1 |
123
  | | LiveCodeBench v5(2408-2501) | 27.96 | 14.7 | 12.19 | 27.24 | 24.73 | 23.66 | 41.22 |
124
  | **Alignment** | IF-Eval | 81.89 | 79.3 | 73.01 | 84.47 | 81.52 | 59.33 | 83.92 |
125
  | | Multi-IF(en+zh) | 78.46 | 61.83 | 61.79 | 78.95 | 76.56 | 62.7 | 77.75 |
126
  | | MTBench | 8.42 | 7.86 | 6.875 | 8.21 | 8.68 | 8.62 | 9.33 |
127
  | | MT-Eval | 8.13 | 7.36 | 6.7 | 8.18 | 8.45 | 8.12 | - |
128
  | | AlignBench v1.1 | 7 | 6.13 | 5.99 | 6.95 | 6.3 | 6.33 | 7.06 |
129
- | | Average | 53.74 | - | 46.05 | 52.61 | 50.54 | 48.95 | - |
130
-
131
  Note:
132
  1. For InternLM3-8B-Instruct, the results marked with `*` are sourced from their official website, other evaluations are conducted based on internal evaluation frameworks.
133
  2. For Multi-IF, we report the overall average computed across all three rounds, pooling the Chinese and English metrics.
 
116
  | **Math** | MATH500 | 86.4 | 68.4 | 79.8 | 85 | 86.8 | 80.6 | 97.2 |
117
  | | AIME24 | 28.33 | 11.25 | 22.92 | 28.33 | 23.96 | 15.83 | 75 |
118
  | | AIME25 | 19.17 | 8.12 | 15.21 | 20.62 | 18.33 | 18.75 | 61.88 |
119
+ | **Code** | HumanEval | 86.59 | 82.3* | 78.05 | 83.54 | 82.32 | 85.37 | 81.71 |
120
+ | | HumanEval+ | 79.27 | - | 73.17 | 76.83 | 75.61 | 83.54 | 76.83 |
121
+ | | MBPPEvalplus | 79.9 | 62.4 | 83.3 | 76.2 | 85.7 | 77.5 | 89.4 |
122
+ | | MBPPEvalplus++ | 68.8 | 50.4 | 71.7 | 66.1 | 74.1 | 66.7 | 75.1 |
123
  | | LiveCodeBench v5(2408-2501) | 27.96 | 14.7 | 12.19 | 27.24 | 24.73 | 23.66 | 41.22 |
124
  | **Alignment** | IF-Eval | 81.89 | 79.3 | 73.01 | 84.47 | 81.52 | 59.33 | 83.92 |
125
  | | Multi-IF(en+zh) | 78.46 | 61.83 | 61.79 | 78.95 | 76.56 | 62.7 | 77.75 |
126
  | | MTBench | 8.42 | 7.86 | 6.875 | 8.21 | 8.68 | 8.62 | 9.33 |
127
  | | MT-Eval | 8.13 | 7.36 | 6.7 | 8.18 | 8.45 | 8.12 | - |
128
  | | AlignBench v1.1 | 7 | 6.13 | 5.99 | 6.95 | 6.3 | 6.33 | 7.06 |
129
+ | | Average | 53.74 | - | 46.54 | 52.61 | 50.54 | 48.95 | - |
 
130
  Note:
131
  1. For InternLM3-8B-Instruct, the results marked with `*` are sourced from their official website, other evaluations are conducted based on internal evaluation frameworks.
132
  2. For Multi-IF, we report the overall average computed across all three rounds, pooling the Chinese and English metrics.