vLLM FlashAttention3 with A6000

#33

by YieumYoon - opened 15 days ago

15 days ago

Does Flash Attention3 work on an A6000? I tried to run it on 2x A6000 GPUs but am facing an error.

(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     assert self.vllm_flash_attn_version == 3, (
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559] AssertionError: Sinks are only supported in FlashAttention 3

podman run --device nvidia.com/gpu=2 --device nvidia.com/gpu=3 -p 8000:8000 --ipc=host vllm/vllm-openai:gptoss --model openai/gp
t-oss-120b --tensor-parallel-size 2
INFO 08-05 15:09:09 [__init__.py:241] Automatically detected platform cuda.
(APIServer pid=1) INFO 08-05 15:09:12 [api_server.py:1787] vLLM API server version 0.10.2.dev2+gf5635d62e.d20250805
(APIServer pid=1) INFO 08-05 15:09:12 [utils.py:326] non-default args: {'model': 'openai/gpt-oss-120b', 'tensor_parallel_size': 2}
(APIServer pid=1) INFO 08-05 15:09:20 [config.py:726] Resolved architecture: GptOssForCausalLM
Parse safetensors files: 100%|██████████| 15/15 [00:00<00:00, 129.79it/s]
(APIServer pid=1) INFO 08-05 15:09:21 [config.py:1759] Using max model len 131072
(APIServer pid=1) WARNING 08-05 15:09:22 [config.py:1198] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
(APIServer pid=1) INFO 08-05 15:09:23 [config.py:2588] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 08-05 15:09:23 [config.py:244] Overriding cuda graph sizes to [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024]
INFO 08-05 15:09:27 [__init__.py:241] Automatically detected platform cuda.
(EngineCore_0 pid=435) INFO 08-05 15:09:30 [core.py:654] Waiting for init message from front-end.
(EngineCore_0 pid=435) INFO 08-05 15:09:30 [core.py:73] Initializing a V1 LLM engine (v0.10.2.dev2+gf5635d62e.d20250805) with config: model='openai/gpt-oss-120b', speculative_config=None, tokenizer='openai/gpt-oss-120b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend='openai'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=openai/gpt-oss-120b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[1024,1008,992,976,960,944,928,912,896,880,864,848,832,816,800,784,768,752,736,720,704,688,672,656,640,624,608,592,576,560,544,528,512,496,480,464,448,432,416,400,384,368,352,336,320,304,288,272,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":1024,"local_cache_dir":null}
(EngineCore_0 pid=435) 
(EngineCore_0 pid=435)              LL          LL          MMM       MMM 
(EngineCore_0 pid=435)              LL          LL          MMMM     MMMM
(EngineCore_0 pid=435)          V   LL          LL          MM MM   MM MM
(EngineCore_0 pid=435) vvvv  VVVV   LL          LL          MM  MM MM  MM
(EngineCore_0 pid=435) vvvv VVVV    LL          LL          MM   MMM   MM
(EngineCore_0 pid=435)  vvv VVVV    LL          LL          MM    M    MM
(EngineCore_0 pid=435)   vvVVVV     LL          LL          MM         MM
(EngineCore_0 pid=435)     VVVV     LLLLLLLLLL  LLLLLLLLL   M           M
(EngineCore_0 pid=435) 
(EngineCore_0 pid=435) WARNING 08-05 15:09:30 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 48 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_0 pid=435) INFO 08-05 15:09:30 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 16777216, 10, 'psm_e24ce977'), local_subscribe_addr='ipc:///tmp/8ba555d2-6503-42cb-b95c-4d67ce837260', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 08-05 15:09:33 [__init__.py:241] Automatically detected platform cuda.
INFO 08-05 15:09:33 [__init__.py:241] Automatically detected platform cuda.
W0805 15:09:36.485000 569 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0805 15:09:36.485000 569 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0805 15:09:36.485000 570 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0805 15:09:36.485000 570 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
(VllmWorker TP1 pid=570) INFO 08-05 15:09:42 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_f0144421'), local_subscribe_addr='ipc:///tmp/cfe42932-5875-4a0a-95d0-f9800953c691', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker TP0 pid=569) INFO 08-05 15:09:42 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_3d3f0683'), local_subscribe_addr='ipc:///tmp/7d660153-4734-47be-ab73-b58bbd15b5b4', remote_subscribe_addr=None, remote_addr_ipv6=False)
[W805 15:09:44.117136145 ProcessGroupNCCL.cpp:915] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W805 15:09:44.126643786 ProcessGroupNCCL.cpp:915] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
(VllmWorker TP1 pid=570) INFO 08-05 15:09:44 [__init__.py:1381] Found nccl from library libnccl.so.2
(VllmWorker TP0 pid=569) INFO 08-05 15:09:44 [__init__.py:1381] Found nccl from library libnccl.so.2
(VllmWorker TP1 pid=570) INFO 08-05 15:09:44 [pynccl.py:70] vLLM is using nccl==2.27.5
(VllmWorker TP0 pid=569) INFO 08-05 15:09:44 [pynccl.py:70] vLLM is using nccl==2.27.5
(VllmWorker TP1 pid=570) INFO 08-05 15:09:45 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
(VllmWorker TP0 pid=569) INFO 08-05 15:09:45 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
(VllmWorker TP0 pid=569) INFO 08-05 15:09:45 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_6316cb31'), local_subscribe_addr='ipc:///tmp/a658c459-2b17-49ab-8f05-077e53b8a57a', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
(VllmWorker TP0 pid=569) INFO 08-05 15:09:45 [parallel_state.py:1102] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker TP1 pid=570) INFO 08-05 15:09:45 [parallel_state.py:1102] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker TP0 pid=569) INFO 08-05 15:09:45 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(VllmWorker TP1 pid=570) INFO 08-05 15:09:45 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(VllmWorker TP0 pid=569) INFO 08-05 15:09:45 [gpu_model_runner.py:1913] Starting to load model openai/gpt-oss-120b...
(VllmWorker TP1 pid=570) INFO 08-05 15:09:45 [gpu_model_runner.py:1913] Starting to load model openai/gpt-oss-120b...
(VllmWorker TP0 pid=569) INFO 08-05 15:09:45 [gpu_model_runner.py:1945] Loading model from scratch...
(VllmWorker TP1 pid=570) INFO 08-05 15:09:45 [gpu_model_runner.py:1945] Loading model from scratch...
(VllmWorker TP0 pid=569) INFO 08-05 15:09:46 [cuda.py:323] Using Flash Attention backend on V1 engine.
(VllmWorker TP1 pid=570) INFO 08-05 15:09:46 [cuda.py:323] Using Flash Attention backend on V1 engine.
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559] WorkerProc failed to start.
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559] Traceback (most recent call last):
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 533, in worker_main
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     worker = WorkerProc(*args, **kwargs)
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 402, in __init__
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     self.worker.load_model()
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     self.model = model_loader.load_model(
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 44, in load_model
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     model = initialize_model(vllm_config=vllm_config,
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     return model_class(vllm_config=vllm_config, prefix=prefix)
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 241, in __init__
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     self.model = GptOssModel(
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]                  ^^^^^^^^^^^^
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 183, in __init__
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 214, in __init__
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     TransformerBlock(
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 183, in __init__
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     self.attn = OAIAttention(config, prefix=f"{prefix}.attn")
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 110, in __init__
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     self.attn = Attention(
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]                 ^^^^^^^^^^
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 176, in __init__
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flash_attn.py", line 417, in __init__
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     assert self.vllm_flash_attn_version == 3, (
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=570) ERROR 08-05 15:09:46 [multiproc_executor.py:559] AssertionError: Sinks are only supported in FlashAttention 3
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559] WorkerProc failed to start.
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559] Traceback (most recent call last):
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 533, in worker_main
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     worker = WorkerProc(*args, **kwargs)
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 402, in __init__
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     self.worker.load_model()
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     self.model = model_loader.load_model(
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 44, in load_model
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     model = initialize_model(vllm_config=vllm_config,
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     return model_class(vllm_config=vllm_config, prefix=prefix)
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 241, in __init__
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     self.model = GptOssModel(
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]                  ^^^^^^^^^^^^
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 183, in __init__
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 214, in __init__
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     TransformerBlock(
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 183, in __init__
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     self.attn = OAIAttention(config, prefix=f"{prefix}.attn")
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 110, in __init__
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     self.attn = Attention(
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]                 ^^^^^^^^^^
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 176, in __init__
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flash_attn.py", line 417, in __init__
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]     assert self.vllm_flash_attn_version == 3, (
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=569) ERROR 08-05 15:09:46 [multiproc_executor.py:559] AssertionError: Sinks are only supported in FlashAttention 3
(VllmWorker TP0 pid=569) INFO 08-05 15:09:46 [multiproc_executor.py:520] Parent process exited, terminating worker
(VllmWorker TP1 pid=570) INFO 08-05 15:09:46 [multiproc_executor.py:520] Parent process exited, terminating worker
[rank0]:[W805 15:09:47.512523887 ProcessGroupNCCL.cpp:1522] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore_0 pid=435) ERROR 08-05 15:09:48 [core.py:718] EngineCore failed to start.
(EngineCore_0 pid=435) ERROR 08-05 15:09:48 [core.py:718] Traceback (most recent call last):
(EngineCore_0 pid=435) ERROR 08-05 15:09:48 [core.py:718]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=435) ERROR 08-05 15:09:48 [core.py:718]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=435) ERROR 08-05 15:09:48 [core.py:718]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=435) ERROR 08-05 15:09:48 [core.py:718]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 510, in __init__
(EngineCore_0 pid=435) ERROR 08-05 15:09:48 [core.py:718]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=435) ERROR 08-05 15:09:48 [core.py:718]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 82, in __init__
(EngineCore_0 pid=435) ERROR 08-05 15:09:48 [core.py:718]     self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=435) ERROR 08-05 15:09:48 [core.py:718]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=435) ERROR 08-05 15:09:48 [core.py:718]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_0 pid=435) ERROR 08-05 15:09:48 [core.py:718]     self._init_executor()
(EngineCore_0 pid=435) ERROR 08-05 15:09:48 [core.py:718]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 96, in _init_executor
(EngineCore_0 pid=435) ERROR 08-05 15:09:48 [core.py:718]     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_0 pid=435) ERROR 08-05 15:09:48 [core.py:718]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=435) ERROR 08-05 15:09:48 [core.py:718]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 472, in wait_for_ready
(EngineCore_0 pid=435) ERROR 08-05 15:09:48 [core.py:718]     raise e from None
(EngineCore_0 pid=435) ERROR 08-05 15:09:48 [core.py:718] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_0 pid=435) Process EngineCore_0:
(EngineCore_0 pid=435) Traceback (most recent call last):
(EngineCore_0 pid=435)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_0 pid=435)     self.run()
(EngineCore_0 pid=435)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_0 pid=435)     self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=435)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_0 pid=435)     raise e
(EngineCore_0 pid=435)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=435)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=435)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=435)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 510, in __init__
(EngineCore_0 pid=435)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=435)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 82, in __init__
(EngineCore_0 pid=435)     self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=435)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=435)   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_0 pid=435)     self._init_executor()
(EngineCore_0 pid=435)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 96, in _init_executor
(EngineCore_0 pid=435)     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_0 pid=435)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=435)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 472, in wait_for_ready
(EngineCore_0 pid=435)     raise e from None
(EngineCore_0 pid=435) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=1)   File "<frozen runpy>", line 88, in _run_code
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1895, in <module>
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1827, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1847, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 167, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 209, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 1520, in inner
(APIServer pid=1)     return fn(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 173, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 119, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 101, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 733, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 421, in __init__
(APIServer pid=1)     with launch_core_engines(vllm_config, executor_class,
(APIServer pid=1)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 697, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 750, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError("Engine core initialization failed. "
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

csjiyw

15 days ago

same error

danchev

15 days ago

@YieumYoon FlashAttention-3 is designed and optimized specifically for NVIDIA Hopper GPUs, such as the H100, and currently lacks support for earlier architectures.

Ref: https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#flashattention-3-beta-release

ullaK

15 days ago

same

gghfez

15 days ago

Looks like they've just fixed it (6 minutes ago)

https://github.com/vllm-project/vllm/commits/main/

sinks should work without flasth-attn-3 if you build main.

hrithiksagar-tih

15 days ago

So, do we simply have to retry by reinstalling "Flash Attn", and it'll work?

danielplominski

15 days ago

•

edited 15 days ago

Same Error, testet on RTX A6000 and RTX 6000A

Stay tuned for an update in: https://hub.docker.com/r/vllm/vllm-openai/tags?name=gptoss

mahernaija

15 days ago

same error with L40

YieumYoon

15 days ago

Looks like they've just fixed it (6 minutes ago)

https://github.com/vllm-project/vllm/commits/main/

sinks should work without flasth-attn-3 if you build main.

I tried to use the git repository directly using this command

uv pip install git+https://github.com/vllm-project/vllm.git \                                    
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
    --index-strategy unsafe-best-match 
    Updated https://github.com/vllm-project/vllm.git (9edd1db02bc6dce6da503503a373657f3466a78b)
Resolved 145 packages in 12.60s
      Built vllm @ git+https://github.com/vllm-project/vllm.git@9edd1db02bc6dce6da503503a373657f3466a78b
Prepared 1 package in 8m 47s
Uninstalled 1 package in 23ms
Installed 1 package in 18ms
 - vllm==0.10.1.dev387+g6e2092435 (from git+https://github.com/vllm-project/vllm.git@6e20924350e3fed375bc63d55166a303b6f0828a)
 + vllm==0.10.1.dev398+g9edd1db02 (from git+https://github.com/vllm-project/vllm.git@9edd1db02bc6dce6da503503a373657f3466a78b)

but I still get an 'Value error, Unknown quantization method: mxfp4' by running this code.

CUDA_VISIBLE_DEVICES=2,3 vllm serve openai/gpt-oss-120b --tensor-parallel-size 2

(APIServer pid=1094799)     s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=1094799) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=1094799)   Value error, Unknown quantization method: mxfp4. Must be one of ['aqlm', 'awq', 'deepspeedfp', 'tpu_int8', 'fp8', 'ptpc_fp8', 'fbgemm_fp8', 'modelopt', 'modelopt_fp4', 'marlin', 'bitblas', 'gguf', 'gptq_marlin_24', 'gptq_marlin', 'gptq_bitblas', 'awq_marlin', 'gptq', 'compressed-tensors', 'bitsandbytes', 'qqq', 'hqq', 'experts_int8', 'neuron_quant', 'ipex', 'quark', 'moe_wna16', 'torchao', 'auto-round', 'rtn', 'inc']. [type=value_error, input_value=ArgsKwargs((), {'model': ...attention_dtype': None}), input_type=ArgsKwargs]
(APIServer pid=1094799)     For further information visit https://errors.pydantic.dev/2.12/v/value_error

arham777

14 days ago

Facing the same issue, not able to deploy openai-oss on A10G GPU. Is the issue solved?

gghfez

14 days ago

I tried to use the git repository directly using this command

Sorry guys, I didn't get a chance to test it / have been using llama.cpp. Just saw the PR

arham777

14 days ago

I tried to use the git repository directly using this command

Sorry guys, I didn't get a chance to test it / have been using llama.cpp. Just saw the PR

Hey can you test it out, tried deploying the model but facing : AssertionError: Sinks are only supported in FlashAttention 3

ucosii

10 days ago

mind setting to env: VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1

ucosii

10 days ago

This comment has been hidden (marked as Resolved)

yuchenxie

10 days ago

mind setting to env: VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1

Is this supposed to be the solution?

gnuzzz

10 days ago

•

edited 10 days ago

Is this supposed to be the solution?

command

VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 vllm serve openai/gpt-oss-20b --served-model-name openai/gpt-oss-20b --host 0.0.0.0

works for me on RTX 3090ti

Bennet85

8 days ago

•

edited 8 days ago

Is this supposed to be the solution?

command
VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 vllm serve openai/gpt-oss-20b --served-model-name openai/gpt-oss-20b --host 0.0.0.0
works for me on RTX 3090ti

Dont know whats wrong but cant get the 20b to work with VLLM on my RTX5090. Even using the latest docker image which supports ampere, it always errors out.

I start it with
sudo docker run
--gpus device=1
-v $HOME/.cache/huggingface:/root/.cache/huggingface
--name vllmgpt
-p 5678:8000
--ipc=host
-e VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1
-e VLLM_USE_TRTLLM_ATTENTION=1
-e VLLM_USE_TRTLLM_DECODE_ATTENTION=1
-e VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1
-e VLLM_USE_FLASHINFER_MXFP4_MOE=1
vllm/vllm-openai:gptoss
--model openai/gpt-oss-20b
--async-scheduling

and thats the log:

INFO 08-12 14:36:59 [init.py:241] Automatically detected platform cuda.
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:01 [api_server.py:1787] vLLM API server version 0.10.2.dev2+gf5635d62e.d20250807
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:01 [utils.py:326] non-default args: {'model': 'openai/gpt-oss-20b', 'async_scheduling': True}
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:09 [config.py:726] Resolved architecture: GptOssForCausalLM
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:10 [config.py:1759] Using max model len 131072
ESC[1;36m(APIServer pid=1)ESC[0;0m WARNING 08-12 14:37:12 [config.py:1198] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:13 [arg_utils.py:1188] Using mp-based distributed executor backend for async scheduling.
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:13 [config.py:2588] Chunked prefill is enabled with max_num_batched_tokens=2048.
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:13 [config.py:244] Overriding cuda graph sizes to [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248
, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024]
INFO 08-12 14:37:18 [init.py:241] Automatically detected platform cuda.
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m INFO 08-12 14:37:21 [core.py:654] Waiting for init message from front-end.
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m INFO 08-12 14:37:21 [core.py:73] Initializing a V1 LLM engine (v0.10.2.dev2+gf5635d62e.d20250807) with config: model='openai/gpt-oss-20b', speculative_config=None, tokenizer='openai/gpt-oss-20b', skip_tokenizer_i
nit=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1,
disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False
, reasoning_backend='openai'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=openai/gpt-oss-20b, num_scheduler_steps=1, multi_step_stre
am_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_att
ention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph
_capture_sizes":[1024,1008,992,976,960,944,928,912,896,880,864,848,832,816,800,784,768,752,736,720,704,688,672,656,640,624,608,592,576,560,544,528,512,496,480,464,448,432,416,400,384,368,352,336,320,304,288,272,256,248,240,232,224,216,208,200,192,184,1
76,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":1024,"local_cache_dir":null}
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m LL LL MMM MMM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m LL LL MMMM MMMM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m V LL LL MM MM MM MM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m vvvv VVVV LL LL MM MM MM MM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m vvvv VVVV LL LL MM MMM MM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m vvv VVVV LL LL MM M MM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m vvVVVV LL LL MM MM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m VVVV LLLLLLLLLL LLLLLLLLL M M
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m WARNING 08-12 14:37:21 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m INFO 08-12 14:37:21 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 16777216, 10, 'psm_06e2e7a6'), local_subscribe_addr='ipc:///tmp/5e7288cc-bd69-4452-95a9-13abc615ea70', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 08-12 14:37:24 [init.py:241] Automatically detected platform cuda.
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:27 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_1ccdd475'), local_subscribe_addr='ipc:///tmp/9c1253c1-33f9-4b39-8f67-f99e7dbc5f86', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:28 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:28 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:28 [gpu_model_runner.py:1913] Starting to load model openai/gpt-oss-20b...
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:28 [gpu_model_runner.py:1945] Loading model from scratch...
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:28 [cuda.py:286] Using Triton backend on V1 engine.
ESC[1;36m(VllmWorker pid=168)ESC[0;0m WARNING 08-12 14:37:28 [rocm.py:29] Failed to import from amdsmi with ModuleNotFoundError("No module named 'amdsmi'")
ESC[1;36m(VllmWorker pid=168)ESC[0;0m WARNING 08-12 14:37:28 [rocm.py:40] Failed to import from vllm._rocm_C with ModuleNotFoundError("No module named 'vllm._rocm_C'")
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:28 [triton_attn.py:263] Using vllm unified attention for TritonAttentionImpl
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:29 [weight_utils.py:296] Using model weights format ['*.safetensors']
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:32 [default_loader.py:262] Loading weights took 3.18 seconds
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:32 [mxfp4.py:176] Shuffling MoE weights, it might take a while...
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] WorkerProc failed to start.
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] Traceback (most recent call last):
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 533, in worker_main
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] worker = WorkerProc(*args, **kwargs)
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 402, in init
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] self.worker.load_model()
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] self.model_runner.load_model(eep_scale_up=eep_scale_up)
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] self.model = model_loader.load_model(
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] process_weights_after_loading(model, model_config, target_device)
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 112, in process_weights_after_loading
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] quant_method.process_weights_after_loading(module)
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/mxfp4.py", line 260, in process_weights_after_loading
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] shuffle_matrix_a(w13_bias[i].clone().reshape(-1, 1),
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] Search for cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559]
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:42 [multiproc_executor.py:520] Parent process exited, terminating worker
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] EngineCore failed to start.
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] Traceback (most recent call last):
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] engine_core = EngineCoreProc(*args, **kwargs)
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 510, in init
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] super().init(vllm_config, executor_class, log_stats,
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 82, in init
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] self.model_executor = executor_class(vllm_config)
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in init
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] self._init_executor()
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 96, in _init_executor
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] self.workers = WorkerProc.wait_for_ready(unready_workers)
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 472, in wait_for_ready
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] raise e from None
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.

firetiger60

6 days ago

Is this supposed to be the solution?

command
VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 vllm serve openai/gpt-oss-20b --served-model-name openai/gpt-oss-20b --host 0.0.0.0
works for me on RTX 3090ti
Dont know whats wrong but cant get the 20b to work with VLLM on my RTX5090. Even using the latest docker image which supports ampere, it always errors out.

I start it with
sudo docker run
--gpus device=1
-v $HOME/.cache/huggingface:/root/.cache/huggingface
--name vllmgpt
-p 5678:8000
--ipc=host
-e VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1
-e VLLM_USE_TRTLLM_ATTENTION=1
-e VLLM_USE_TRTLLM_DECODE_ATTENTION=1
-e VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1
-e VLLM_USE_FLASHINFER_MXFP4_MOE=1
vllm/vllm-openai:gptoss
--model openai/gpt-oss-20b
--async-scheduling

and thats the log:

INFO 08-12 14:36:59 [init.py:241] Automatically detected platform cuda.
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:01 [api_server.py:1787] vLLM API server version 0.10.2.dev2+gf5635d62e.d20250807
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:01 [utils.py:326] non-default args: {'model': 'openai/gpt-oss-20b', 'async_scheduling': True}
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:09 [config.py:726] Resolved architecture: GptOssForCausalLM
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:10 [config.py:1759] Using max model len 131072
ESC[1;36m(APIServer pid=1)ESC[0;0m WARNING 08-12 14:37:12 [config.py:1198] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:13 [arg_utils.py:1188] Using mp-based distributed executor backend for async scheduling.
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:13 [config.py:2588] Chunked prefill is enabled with max_num_batched_tokens=2048.
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:13 [config.py:244] Overriding cuda graph sizes to [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248
, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024]
INFO 08-12 14:37:18 [init.py:241] Automatically detected platform cuda.
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m INFO 08-12 14:37:21 [core.py:654] Waiting for init message from front-end.
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m INFO 08-12 14:37:21 [core.py:73] Initializing a V1 LLM engine (v0.10.2.dev2+gf5635d62e.d20250807) with config: model='openai/gpt-oss-20b', speculative_config=None, tokenizer='openai/gpt-oss-20b', skip_tokenizer_i
nit=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1,
disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False
, reasoning_backend='openai'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=openai/gpt-oss-20b, num_scheduler_steps=1, multi_step_stre
am_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_att
ention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph
_capture_sizes":[1024,1008,992,976,960,944,928,912,896,880,864,848,832,816,800,784,768,752,736,720,704,688,672,656,640,624,608,592,576,560,544,528,512,496,480,464,448,432,416,400,384,368,352,336,320,304,288,272,256,248,240,232,224,216,208,200,192,184,1
76,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":1024,"local_cache_dir":null}
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m LL LL MMM MMM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m LL LL MMMM MMMM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m V LL LL MM MM MM MM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m vvvv VVVV LL LL MM MM MM MM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m vvvv VVVV LL LL MM MMM MM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m vvv VVVV LL LL MM M MM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m vvVVVV LL LL MM MM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m VVVV LLLLLLLLLL LLLLLLLLL M M
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m WARNING 08-12 14:37:21 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m INFO 08-12 14:37:21 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 16777216, 10, 'psm_06e2e7a6'), local_subscribe_addr='ipc:///tmp/5e7288cc-bd69-4452-95a9-13abc615ea70', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 08-12 14:37:24 [init.py:241] Automatically detected platform cuda.
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:27 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_1ccdd475'), local_subscribe_addr='ipc:///tmp/9c1253c1-33f9-4b39-8f67-f99e7dbc5f86', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:28 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:28 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:28 [gpu_model_runner.py:1913] Starting to load model openai/gpt-oss-20b...
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:28 [gpu_model_runner.py:1945] Loading model from scratch...
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:28 [cuda.py:286] Using Triton backend on V1 engine.
ESC[1;36m(VllmWorker pid=168)ESC[0;0m WARNING 08-12 14:37:28 [rocm.py:29] Failed to import from amdsmi with ModuleNotFoundError("No module named 'amdsmi'")
ESC[1;36m(VllmWorker pid=168)ESC[0;0m WARNING 08-12 14:37:28 [rocm.py:40] Failed to import from vllm._rocm_C with ModuleNotFoundError("No module named 'vllm._rocm_C'")
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:28 [triton_attn.py:263] Using vllm unified attention for TritonAttentionImpl
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:29 [weight_utils.py:296] Using model weights format ['*.safetensors']
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:32 [default_loader.py:262] Loading weights took 3.18 seconds
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:32 [mxfp4.py:176] Shuffling MoE weights, it might take a while...
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] WorkerProc failed to start.
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] Traceback (most recent call last):
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 533, in worker_main
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] worker = WorkerProc(*args, **kwargs)
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 402, in init
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] self.worker.load_model()
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] self.model_runner.load_model(eep_scale_up=eep_scale_up)
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] self.model = model_loader.load_model(
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] process_weights_after_loading(model, model_config, target_device)
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 112, in process_weights_after_loading
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] quant_method.process_weights_after_loading(module)
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/mxfp4.py", line 260, in process_weights_after_loading
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] shuffle_matrix_a(w13_bias[i].clone().reshape(-1, 1),
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] Search for cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559]
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:42 [multiproc_executor.py:520] Parent process exited, terminating worker
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] EngineCore failed to start.
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] Traceback (most recent call last):
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] engine_core = EngineCoreProc(*args, **kwargs)
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 510, in init
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] super().init(vllm_config, executor_class, log_stats,
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 82, in init
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] self.model_executor = executor_class(vllm_config)
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in init
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] self._init_executor()
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 96, in _init_executor
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] self.workers = WorkerProc.wait_for_ready(unready_workers)
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 472, in wait_for_ready
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] raise e from None
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.

I get the exact same error :-(

Amundskjii

1 day ago

Is this supposed to be the solution?

command
VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 vllm serve openai/gpt-oss-20b --served-model-name openai/gpt-oss-20b --host 0.0.0.0
works for me on RTX 3090ti
Dont know whats wrong but cant get the 20b to work with VLLM on my RTX5090. Even using the latest docker image which supports ampere, it always errors out.

I start it with
sudo docker run
--gpus device=1
-v $HOME/.cache/huggingface:/root/.cache/huggingface
--name vllmgpt
-p 5678:8000
--ipc=host
-e VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1
-e VLLM_USE_TRTLLM_ATTENTION=1
-e VLLM_USE_TRTLLM_DECODE_ATTENTION=1
-e VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1
-e VLLM_USE_FLASHINFER_MXFP4_MOE=1
vllm/vllm-openai:gptoss
--model openai/gpt-oss-20b
--async-scheduling

and thats the log:

INFO 08-12 14:36:59 [init.py:241] Automatically detected platform cuda.
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:01 [api_server.py:1787] vLLM API server version 0.10.2.dev2+gf5635d62e.d20250807
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:01 [utils.py:326] non-default args: {'model': 'openai/gpt-oss-20b', 'async_scheduling': True}
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:09 [config.py:726] Resolved architecture: GptOssForCausalLM
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:10 [config.py:1759] Using max model len 131072
ESC[1;36m(APIServer pid=1)ESC[0;0m WARNING 08-12 14:37:12 [config.py:1198] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:13 [arg_utils.py:1188] Using mp-based distributed executor backend for async scheduling.
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:13 [config.py:2588] Chunked prefill is enabled with max_num_batched_tokens=2048.
ESC[1;36m(APIServer pid=1)ESC[0;0m INFO 08-12 14:37:13 [config.py:244] Overriding cuda graph sizes to [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248
, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024]
INFO 08-12 14:37:18 [init.py:241] Automatically detected platform cuda.
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m INFO 08-12 14:37:21 [core.py:654] Waiting for init message from front-end.
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m INFO 08-12 14:37:21 [core.py:73] Initializing a V1 LLM engine (v0.10.2.dev2+gf5635d62e.d20250807) with config: model='openai/gpt-oss-20b', speculative_config=None, tokenizer='openai/gpt-oss-20b', skip_tokenizer_i
nit=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1,
disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False
, reasoning_backend='openai'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=openai/gpt-oss-20b, num_scheduler_steps=1, multi_step_stre
am_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_att
ention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph
_capture_sizes":[1024,1008,992,976,960,944,928,912,896,880,864,848,832,816,800,784,768,752,736,720,704,688,672,656,640,624,608,592,576,560,544,528,512,496,480,464,448,432,416,400,384,368,352,336,320,304,288,272,256,248,240,232,224,216,208,200,192,184,1
76,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":1024,"local_cache_dir":null}
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m LL LL MMM MMM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m LL LL MMMM MMMM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m V LL LL MM MM MM MM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m vvvv VVVV LL LL MM MM MM MM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m vvvv VVVV LL LL MM MMM MM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m vvv VVVV LL LL MM M MM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m vvVVVV LL LL MM MM
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m VVVV LLLLLLLLLL LLLLLLLLL M M
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m WARNING 08-12 14:37:21 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m INFO 08-12 14:37:21 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 16777216, 10, 'psm_06e2e7a6'), local_subscribe_addr='ipc:///tmp/5e7288cc-bd69-4452-95a9-13abc615ea70', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 08-12 14:37:24 [init.py:241] Automatically detected platform cuda.
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:27 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_1ccdd475'), local_subscribe_addr='ipc:///tmp/9c1253c1-33f9-4b39-8f67-f99e7dbc5f86', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:28 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:28 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:28 [gpu_model_runner.py:1913] Starting to load model openai/gpt-oss-20b...
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:28 [gpu_model_runner.py:1945] Loading model from scratch...
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:28 [cuda.py:286] Using Triton backend on V1 engine.
ESC[1;36m(VllmWorker pid=168)ESC[0;0m WARNING 08-12 14:37:28 [rocm.py:29] Failed to import from amdsmi with ModuleNotFoundError("No module named 'amdsmi'")
ESC[1;36m(VllmWorker pid=168)ESC[0;0m WARNING 08-12 14:37:28 [rocm.py:40] Failed to import from vllm._rocm_C with ModuleNotFoundError("No module named 'vllm._rocm_C'")
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:28 [triton_attn.py:263] Using vllm unified attention for TritonAttentionImpl
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:29 [weight_utils.py:296] Using model weights format ['*.safetensors']
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:32 [default_loader.py:262] Loading weights took 3.18 seconds
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:32 [mxfp4.py:176] Shuffling MoE weights, it might take a while...
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] WorkerProc failed to start.
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] Traceback (most recent call last):
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 533, in worker_main
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] worker = WorkerProc(*args, **kwargs)
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 402, in init
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] self.worker.load_model()
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] self.model_runner.load_model(eep_scale_up=eep_scale_up)
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] self.model = model_loader.load_model(
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^^^^^^
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] process_weights_after_loading(model, model_config, target_device)
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 112, in process_weights_after_loading
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] quant_method.process_weights_after_loading(module)
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/mxfp4.py", line 260, in process_weights_after_loading
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] shuffle_matrix_a(w13_bias[i].clone().reshape(-1, 1),
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] ^^^^^^^^^^^^^^^^^^^
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] Search for cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559] Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.
ESC[1;36m(VllmWorker pid=168)ESC[0;0m ERROR 08-12 14:37:42 [multiproc_executor.py:559]
ESC[1;36m(VllmWorker pid=168)ESC[0;0m INFO 08-12 14:37:42 [multiproc_executor.py:520] Parent process exited, terminating worker
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] EngineCore failed to start.
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] Traceback (most recent call last):
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] engine_core = EngineCoreProc(*args, **kwargs)
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 510, in init
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] super().init(vllm_config, executor_class, log_stats,
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 82, in init
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] self.model_executor = executor_class(vllm_config)
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in init
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] self._init_executor()
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 96, in _init_executor
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] self.workers = WorkerProc.wait_for_ready(unready_workers)
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 472, in wait_for_ready
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] raise e from None
ESC[1;36m(EngineCore_0 pid=114)ESC[0;0m ERROR 08-12 14:37:44 [core.py:718] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.

Any news on this. I tested one week ago - with same error. RTX5090 + vLLM. - Has anyone got it up and running with Blackwell?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment