Performance gains on low-end hardware

#114
by dsprag - opened

I've been playing with this on a laptop (Intel 185H, 32 GB RAM, IGPU) using LM Studio, and I'm a little shocked. Typically, with 20-24B models I see speeds of 2.5-4 T/S (excruciatingly slow), but with this model and its derivatives I'm seeing 3-4x that (not much, but far more usable). What would account for that performance gain?

I'm seeing similar gains (2x) over typically 30B models - but nearly the same performance as Qwen3-Coder-30B-A3B.

Qwen3's performance is due mostly to it being an MoE model. It's a 30B parameter model, but it only uses a couple of 3B parameter experts at a time. So the performance is because it isn't routing tokens through all of the layers, but only a couple of layers for any one prompt. I suspect GPT-OSS is a MoE model too, but I can't find anything that claims it is.

It appears to be (LM Studio has a slider for experts in model settings that only shows up for MoE models). I've done some limited testing with other MoE models since noticing the gain with gpt-oss, but haven't seen anything quite as high. I suppose it's to do with the particular implementation of MoE, maybe combined with other aspects of the gpt-oss model.

README.md:

gpt-oss-20b — for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters)

3.6B active parameters out of 21B + mxfp4 compression optimized for this model = good speed.

Sign up or log in to comment