openai/gpt-oss-20b · Latest vLLM docker image doesn't yet support gpt-oss

16 days ago

•

Just for awareness, using the latest image published by vLLM, I'm running into issues, seems vLLM need to update their image with the most recent version of transformers that supports the model architecture, placing this issue here as the deployment methods shown by huggingface don't actually currently work for vLLM containerised server deployments

Value error, The checkpoint you are trying to load has model type gpt_oss but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

Docker start command:

docker run --runtime nvidia --gpus all
--name gpt-oss
-v ~/.cache/huggingface:/root/.cache/huggingface
-p 8000:8000
--ipc=host
vllm/vllm-openai:latest
--model openai/gpt-oss-20b
--max-model-len 8192
--gpu-memory-utilization 0.95

Possibly I'm too early!

ht7770

16 days ago

Found on docker hub that the image to use for gpt-oss is vllm/vllm-openai:gptoss

baiwu

15 days ago

however vllm/vllm-openai:gptoss will meet this error:

(EngineCore_0 pid=387)
(EngineCore_0 pid=387) LL LL MMM MMM
(EngineCore_0 pid=387) LL LL MMMM MMMM
(EngineCore_0 pid=387) V LL LL MM MM MM MM
(EngineCore_0 pid=387) vvvv VVVV LL LL MM MM MM MM
(EngineCore_0 pid=387) vvvv VVVV LL LL MM MMM MM
(EngineCore_0 pid=387) vvv VVVV LL LL MM M MM
(EngineCore_0 pid=387) vvVVVV LL LL MM MM
(EngineCore_0 pid=387) VVVV LLLLLLLLLL LLLLLLLLL M M
(EngineCore_0 pid=387)
(EngineCore_0 pid=387) INFO 08-06 00:56:58 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=387) INFO 08-06 00:56:58 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(EngineCore_0 pid=387) INFO 08-06 00:56:58 [gpu_model_runner.py:1913] Starting to load model openai/gpt-oss-20b...
(EngineCore_0 pid=387) INFO 08-06 00:56:58 [gpu_model_runner.py:1945] Loading model from scratch...
(EngineCore_0 pid=387) INFO 08-06 00:56:58 [cuda.py:323] Using Flash Attention backend on V1 engine.
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] EngineCore failed to start.
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] Traceback (most recent call last):
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 510, in init
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] super().init(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 82, in init
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] self._init_executor()
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] self.collective_rpc("load_model")
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/utils/init.py", line 2948, in run_method
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] return func(*args, **kwargs)
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] self.model = model_loader.load_model(
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 44, in load_model
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] model = initialize_model(vllm_config=vllm_config,
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 241, in init
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] self.model = GptOssModel(
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 183, in init
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 214, in init
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] TransformerBlock(
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 183, in init
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] self.attn = OAIAttention(config, prefix=f"{prefix}.attn")
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 110, in init
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] self.attn = Attention(
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 176, in init
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flash_attn.py", line 417, in init
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] assert self.vllm_flash_attn_version == 3, (
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] AssertionError: Sinks are only supported in FlashAttention 3
(EngineCore_0 pid=387) Process EngineCore_0:
(EngineCore_0 pid=387) Traceback (most recent call last):

ht7770

15 days ago

•

edited 15 days ago

Looks like you are using vLLM on a GPU that doesn't have flash attention 3 supported, vLLM only supports NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs for this model

https://blog.vllm.ai/2025/08/05/gpt-oss.html

foobarvar123

3 days ago

•

edited 3 days ago

Please try changing the docker image to vllm/vllm-openai:gptoss