Can gpt-oss support local vllm deployment on a100 GPU?
Has anyone successfully deployed it?
At PostgresPro, we are trying to deploy a model on our A100 80GB GPU, but we’re currently facing issues with vLLM and FlashAttention 3. We followed an OpenAI guide, but the model won’t start. If we achieve positive results, I’ll write a guide.
Yes you can on H100 but not on A100
you can't since A100 does not support mxfp4 quantization
They say that the models, even the small one, are only available for Ada/Hopper and newer. So a100 is a no-no. Too bad, as we have thousands of those here
We cannot currently deploy them on our A100s or our L40Ss. There's a few GitHub issues a mile long with people having the same problems, hopefully the vLLM crew is working on getting older cards supported.
This is a flash-attention thing. Give it a try running without it.
Actually, regardless of efficiency, Sinks doesn't have to be supported only since Hopper and FA3, so does MXFP4.
Here is a temp solution for single A100 (80G) to serve whatever 20B and 120B version: Tutel Instruction to Run GptOSS.
See https://github.com/vllm-project/vllm/issues/22290#issuecomment-3165645703
Performance for a single user (until 2025/Aug/8):
Tutel GptOSS (20B on 1xA100): 212 tps
VLLM GptOSS (20B on 1xA100): 139 tps
SLANG GptOSS (20B on 1xA100): (on-going?)
OLLAMA GptOSS (20B on 1xA100): 75 tps
Just bumping that indeed https://github.com/vllm-project/vllm/issues/22290#issuecomment-3165645703 you can now run vllm on A100.