60.0 on aime2025 ??

#3
by youyc22 - opened

Has anyone tried this model, is it really that good?

I tested the awq version of this model on aime2025 and scored less than 45%. I don't think awq would cause such a big performance drop

youyc22 changed discussion status to closed
simplescaling org

You can rerun our exact evaluation here: https://github.com/simplescaling/s1/blob/a465d7f429ccdcffe547cae91ec5ac3d7eb47054/eval/commands.sh#L15 ; Let me know if it does not lead to 60%.

Sign up or log in to comment