Latest vLLM docker image doesn't yet support gpt-oss
Just for awareness, using the latest image published by vLLM, I'm running into issues, seems vLLM need to update their image with the most recent version of transformers that supports the model architecture, placing this issue here as the deployment methods shown by huggingface don't actually currently work for vLLM containerised server deployments
Value error, The checkpoint you are trying to load has model type gpt_oss but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
Docker start command:
docker run --runtime nvidia --gpus all
--name gpt-oss
-v ~/.cache/huggingface:/root/.cache/huggingface
-p 8000:8000
--ipc=host
vllm/vllm-openai:latest
--model openai/gpt-oss-20b
--max-model-len 8192
--gpu-memory-utilization 0.95
Possibly I'm too early!
Found on docker hub that the image to use for gpt-oss is vllm/vllm-openai:gptoss
however vllm/vllm-openai:gptoss will meet this error:
(EngineCore_0 pid=387)
(EngineCore_0 pid=387) LL LL MMM MMM
(EngineCore_0 pid=387) LL LL MMMM MMMM
(EngineCore_0 pid=387) V LL LL MM MM MM MM
(EngineCore_0 pid=387) vvvv VVVV LL LL MM MM MM MM
(EngineCore_0 pid=387) vvvv VVVV LL LL MM MMM MM
(EngineCore_0 pid=387) vvv VVVV LL LL MM M MM
(EngineCore_0 pid=387) vvVVVV LL LL MM MM
(EngineCore_0 pid=387) VVVV LLLLLLLLLL LLLLLLLLL M M
(EngineCore_0 pid=387)
(EngineCore_0 pid=387) INFO 08-06 00:56:58 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=387) INFO 08-06 00:56:58 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(EngineCore_0 pid=387) INFO 08-06 00:56:58 [gpu_model_runner.py:1913] Starting to load model openai/gpt-oss-20b...
(EngineCore_0 pid=387) INFO 08-06 00:56:58 [gpu_model_runner.py:1945] Loading model from scratch...
(EngineCore_0 pid=387) INFO 08-06 00:56:58 [cuda.py:323] Using Flash Attention backend on V1 engine.
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] EngineCore failed to start.
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] Traceback (most recent call last):
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 510, in init
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] super().init(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 82, in init
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] self._init_executor()
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] self.collective_rpc("load_model")
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/utils/init.py", line 2948, in run_method
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] return func(*args, **kwargs)
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] self.model = model_loader.load_model(
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 44, in load_model
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] model = initialize_model(vllm_config=vllm_config,
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 241, in init
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] self.model = GptOssModel(
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 183, in init
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 214, in init
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] TransformerBlock(
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 183, in init
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] self.attn = OAIAttention(config, prefix=f"{prefix}.attn")
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 110, in init
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] self.attn = Attention(
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 176, in init
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flash_attn.py", line 417, in init
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] assert self.vllm_flash_attn_version == 3, (
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=387) ERROR 08-06 00:56:58 [core.py:718] AssertionError: Sinks are only supported in FlashAttention 3
(EngineCore_0 pid=387) Process EngineCore_0:
(EngineCore_0 pid=387) Traceback (most recent call last):
Looks like you are using vLLM on a GPU that doesn't have flash attention 3 supported, vLLM only supports NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs for this model
Please try changing the docker image to vllm/vllm-openai:gptoss