gpt-oss is actually good. even on less common benchmark

#109
by weijiejailbreak - opened

I’ve been experimenting with gpt-oss since its release, and unlike many posts/news I’ve seen, it’s surprisingly powerful — even on uncommon datasets. I tested it on our recent benchmark SATA-Bench — a benchmark where each question has at least two correct answers (rare in standard LLM Evaluation).

Results (See picture below):

120B open-source model is similar to GPT-4.1's performance on SATA-Bench.

20B model lags behind but still matches DeepSeek R1 & Llama-3.1-405B.
performance.jpeg

Key takeaways:

Repetitive reasoning hurts — 11% of 20B outputs loop, losing ~9 exact match rate.

Reason–answer mismatches happen often in 20B and they tend to produce one answer even if their reason suggest a few answer is correct.

Longer ≠ better — overthinking reduces accuracy.

Detailed findings: https://weijiexu.com/posts/sata_bench_experiments.html

SATA-Bench dataset: https://huggingface.co/datasets/sata-bench/sata-bench

Perhaps you're misunderstanding much of the criticism. The gpt-os models are unusually "intelligent", although the frequent looping you mentioned, as well as the failure to output a response after thinking, especially with the 20b model, is very annoying and a deal breaker for many people.

The primary weaknesses of these models are their VERY low broad knowledge, poor performance on creative tasks, ubiquitous and absurd refusals, and numerous blind spots when it comes to general use cases. This makes the models unusable to the general population, or even as a daily driver of AI nerds.

In short, the gpt-os models are not even close to being general purpose AI models. They're overfit thinking models which are only competent in a handful of the top 100 AI use cases, plus they fall short of the proprietary models at the tasks their competent in.

And the SATA test you referenced is anything but a broad domain test, despite what the authors claim. It's extremely narrow in scope, covering just a handful of domains like law, biomedicine, toxicity, and reading comprehension.

Especially the ubiquitous and absurd refusals that turned it into garbage. Totally!

Sign up or log in to comment