60.0 on aime2025 ??
#3
by
youyc22
- opened
Has anyone tried this model, is it really that good?
I tested the awq version of this model on aime2025 and scored less than 45%. I don't think awq would cause such a big performance drop
youyc22
changed discussion status to
closed
You can rerun our exact evaluation here: https://github.com/simplescaling/s1/blob/a465d7f429ccdcffe547cae91ec5ac3d7eb47054/eval/commands.sh#L15 ; Let me know if it does not lead to 60%.