AIME2024 has 30 Tests - Cant score 80.96

#36

by fblgit - opened Feb 19

Feb 19

Hi there... Im not sure how this AIME2024 evaluation has been done.. but AIME2024 has 15 & 15 tests.
R1 scored 79.8/80 meaning, 24 out of 30 samples were answered correctly.
If this model has outperformed R1, means it answered correctly 25 questions.. scoring 83.3..
What kind of AIME2024 has been used that is able to produce fractions of the questions?

I refer to:

It also feels very strange being able to improve a result on this test while degrading the majority of the known benchmarks..

eugenhotaj-ppl

Perplexity org Feb 19

•

edited Feb 19

What are you talking about 😂 ? Clearly this is not 79.8, maybe there's something off in your understanding somewhere.

fblgit

Feb 20

How many right answers this model got on AIME to score 80.96 ?
Because R1 got 24 right answers to score his benchmark.

owao

Feb 22

@fblgit I think the decimal depth comes from averaging multiple runs

owao

Feb 22

@eugenhotaj-ppl Could anyone of your team address this massive issue about the closed source dataset and classifier? The whole community would be so grateful if you decide to make it publicly available, as this could serve the ones that need it the most.
You still can change the trajectory. But time window is shrinking. So if you need more to evaluate your move, you can also freeze everything, remove the model, apologize for all the trouble it caused. Then take the necessary time to evaluate the situation and take into consideration all the feedback you receive.

Feedback is a gift people offer by love. And it's precious. I hope you could learn this the easier way.
Beware recover from trahison is hard and not always happening.

fblgit

Feb 24

the model degrades on all other benchmarks except 1 answer of AIME that "sometimes" was answered correctly.. we tested the model, on grounded do_sample=False w/temperature 0.. R1 performs better on all tasks absolutely.
IMHO...AI Hype Powerpoint branding has 0 scientific value.. is leaking true empirical evidences as experiment. So far none of these actors has ever released anything besides the branding powerpoint X.com posts with tergiversated benchmark scores..

eugenhotaj-ppl

Perplexity org Feb 24

@fblgit all metrics are within noise -- the AIME one is not "better" just like the other ones are not "worse". You've already shown you have no idea what you're talking about in your first post, please stop digging yourself deeper into a hole.

fblgit

Feb 24

people is just waiting for the magical dataset anti-censor that increases MMLU/MATH/etc.. :D

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment