Evaluation of AIME2024 Scores and Discrepancies with the Published Results
Title: Evaluation of AIME2024 Scores and Discrepancies with the Published Results
Dear Team,
First of all, I would like to express my sincere respect for your impressive work.
However, I have some fundamental questions regarding your publication, and I hope you don’t mind me seeking some clarification.
I conducted a benchmark evaluation on AIME2024 using the dataset “Maxwell-Jia/AIME_2024.” The details of my process are as follows:
- I did not specify any MaxLength.
- I manually verified each example by checking if the value inside the boxed answer matched the expected answer.
- I hosted both the base model and your model on a VLLM server and passed the “Problem” field from the dataset.
The results were:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B: Correct answer rate: 10/30 (33.3%)
- agentica-org/DeepScaleR-1.5B-Preview: Correct answer rate: 10/30 (33.3%)
Additionally, when I ran Stage1 (8K) training for 3 epochs using the provided code, I encountered catastrophic forgetting.
To summarize:
- The scores I obtained differ from those published, and there appears to be no performance improvement over the base model (in some cases, problems that were originally answered correctly are now answered incorrectly).
- For DeepScaleR-1.5B-Preview, I also observed catastrophic forgetting (with some calculations entering an infinite loop).
Given these outcomes, could you please clarify:
- Were any special system prompts used during benchmarking?
- Was there any specific configuration applied to the MaxLength parameter?
At this point, I am concerned about the reproducibility of the reported results and whether the published benchmarks and model status accurately reflect the underlying research and code. Could you please confirm that these benchmarks were measured as described? If possible, I would greatly appreciate any evidence or outputs from your evaluations.
I truly believe in the excellence of your work, but I have not yet been able to verify that the model possesses the advertised capabilities.
If you could share the benchmarking code or any details regarding any special methods used, it would be most helpful.
Sincerely,
Are you sure you evaluated correctly?
Try bloat16, max length 2**15, temperature 0.6 and top_p 0.95. That should be it!
There are two separate Git issues of people being able to replicate our results:
https://github.com/agentica-project/deepscaler/issues/3
https://github.com/agentica-project/deepscaler/issues/12
You should also evaluate each problem 16 times and average the results. AIME only has 30 problems and the variance of each trial is very high.
I see. So, the correctness depends on probability, right?
Since I haven’t specified the temperature or anything else, let’s set everything and try it 30 times.
Thank you
Make sure the hyperparameters suggested above are implemented in your eval. This also matches what Deepseek did ;)
Thank you. However, I have realized that my expectations for DeepScaler were different from its actual performance, so I will stop here.
In mathematics, there is only one correct answer. If the claim of improved accuracy is based on the increase in the average score over multiple attempts rather than consistently providing the correct answer, then it seems we had a different understanding.
I appreciate your thoughtful responses. I will conclude my investigation on this matter here.
I think there is a fundamental misunderstanding here.
Deepscaler's average Pass 1 is ~43%. Across different pass 1 it can be as low as 25% or even as high as 60%. You have to take average to smooth out the variance. All LLMs suffer from this unfortunately and we aren't cherrypicking the best pass 1.
We are following the recommended practice by Deepseek (https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B). They use 0.6 as temperature, that means sampling is not deterministic, then you need to average over multiple trials to smooth out the variance. For example, Deepseek's official number is taken average over 64 trials. If you sample over once, then you can either get accuracy as high as 60% or as low as 25%, averaging it out just means that we are reporting the true performance of DeepScaleR.