Unable to load gpt-oss-20b on dual L40 (48GB) GPUs with vLLM
I am trying to serve gpt-oss-20b using vLLM on a server equipped with 2 × NVIDIA L40 (48GB, PCIe) GPUs. According to the documentation, the model should fit on much smaller GPUs (≈16GB VRAM required with MXFP4 weights), so 2 × 48GB should be more than enough.
However, the model fails to load properly in my environment. Some details:
- Hardware: 2 × NVIDIA L40 (48GB each, PCIe, no NVLink)
Software stack:
- CUDA 12.x, driver version [insert here]
- Python 3.12
- vLLM 0.10.1+gptoss (installed via the official wheels.vllm.ai index)
Questions:
- Has anyone successfully loaded gpt-oss-20b with vLLM on dual L40 GPUs?
- Are there known issues with L40 (Ada Lovelace, SM 8.9) and the prebuilt vLLM wheels (e.g., missing arch flags)?
- Could PCIe-only topology (no NVLink) and ACS settings cause NCCL initialization failures in this setup?
- Is there any recommended configuration or workaround to ensure stable loading?
Any guidance or confirmation from others who tried a similar setup would be greatly appreciated. Thanks!
Load it on a single card. There are no benefits to loading it across 2 cards. In fact, you will degrade performance by doing so. CUDA cores are NOT additive in terms of 'speed' if your model fits within the confines of a single card - the trip over PCIe or NVLINK kills any gain you'd get in most scenarios. The only time you want to stripe a model across multiple cards is if it needs multiple cards to fit all layers in vmem. You can load two instances of the model and have each running simultaneous streams of inference with once instance on each card, but you cannot speed it up by adding more cards. Anything Ada is going to work with any Ada generation card. There are some exceptions to that with much older, pre-Ampere cards where not all implement all functionality, but Ampere, Ada, and Blackwell are all feature equivalent across the line short of datacenter configurations that add in NVLink/NVSwitch.
Are you able to run other models outside of this one? When you installed vLLM, were your environment variables set correctly to expose CUDA? What is the actual error message you are receiving? What have you done to troubleshoot?
I do not believe Ada is supported yet https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html but I hope I am wrong
такое чувство что сам vllm глючный, у меня модель 20b не запускается даже на А100 Tesla, мне интересно это OpenAI так пошутили что можно развернуть на 16Gb карте или что ?