[v1 engine][flash_attn backend] TypeError: flash_attn_varlen_func() got an unexpected keyword argument 's_aux' when running gpt-oss-120b on H200
vLLM v1 engine crashes at runtime when using the FlashAttention backend. The kernel call passes s_aux, but the installed flash-attn doesn’t accept this keyword, causing:
TypeError: flash_attn_varlen_func() got an unexpected keyword argument 's_aux'
Environment
OS: (e.g., Ubuntu 22.04)
Python: 3.12
vLLM: v0.10.2.dev2+gf5635d62e.d20250805 (from log)
PyTorch: 2.9.0.dev20250804+cu128
CUDA: 12.8
GPUs: (e.g., H200, H200)
Install method: uv pip install --pre vllm==0.10.1+gptoss
--extra-index-url https://wheels.vllm.ai/gpt-oss/
--extra-index-url https://download.pytorch.org/whl/nightly/cu128
--index-strategy unsafe-best-match
Command to Reproduce:
export VLLM_ATTENTION_BACKEND=flash-attn # default or implied
vllm serve openai/gpt-oss-120b
--host 0.0.0.0
--tensor-parallel-size 2
Send any chat/completions request; crash happens on first step
Key Logs: TypeError: flash_attn_varlen_func() got an unexpected keyword argument 's_aux'
at vllm/v1/attention/backends/flash_attn.py:526
...
torch._ops.vllm.unified_attention_with_output(...)
What I tried
Switching to torch-sdpa backend works as a workaround:
export VLLM_ATTENTION_BACKEND=torch-sdpa
Rebuilding flash-attn (current PyTorch is nightly 2.9 + CUDA 12.8; wheel not available).
Uninstalled flash-attn/flashinfer to avoid version conflicts.
Unfortunately, none of these steps resolved the issue. I'd appreciate any guidance on resolving this.
I have the same issue. Why did that happen?
I have the same issue (with 2x H100 GPUs)
(EngineCore_0 pid=) ERROR 08-06 00:32:38 [core.py:720] RuntimeError: Worker failed with error 'flash_attn_varlen_func() got an unexpected keyword argument 's_aux'', please check the stack trace above for the root cause
...
(APIServer pid=) ERROR 08-06 00:32:38 [serving_chat.py:1001] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
Same for me
Encountered the same issue when specifying "flash_attention_2" for the attn_implementation argument. The issue was resolved by either omitting the argument or explicitly setting it to "sdpa".
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto",
attn_implementation="sdpa",
)
messages = [
{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to(model.device)
generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1] :]))
Investigation
I found there is the diffrence on vllm
and openai-harmony
from docker container which works well locally.
Here is md5sum of wheels provided in https://wheels.vllm.ai/gpt-oss/ set as extra-index in the gpt-oss guide.
a9c90e016ef0f04e963944f48c13da86 flashinfer_python-0.2.8-py3-none-any.whl
2be184d07d76ea6a1ad41e069de13d8f gpt_oss-0.1.0-py3-none-any.whl
5f2093088f8a5c46fa010000dbde807b triton-3.4.0+git663e04e8-cp38-abi3-linux_x86_64.whl
5efee9b5fa4409318695c7e9dc155282 triton_kernels-1.0.0-py3-none-any.whl
d790a1cbde026100db2d5626667454ce vllm-0.10.1+gptoss-cp38-abi3-linux_x86_64.whl
However, the wheels inside of the working container are as follows:
a9c90e016ef0f04e963944f48c13da86 flashinfer_python-0.2.8-py3-none-any.whl
2be184d07d76ea6a1ad41e069de13d8f gpt_oss-0.1.0-py3-none-any.whl
27d4a3918f6af4550bdb5f66bf68ae39 openai_harmony-0.1.0-cp39-abi3-manylinux_2_34_x86_64.whl
5f2093088f8a5c46fa010000dbde807b triton-3.4.0+git663e04e8-cp38-abi3-linux_x86_64.whl
5efee9b5fa4409318695c7e9dc155282 triton_kernels-1.0.0-py3-none-any.whl
47d8963ec500eb6793191f1d3b3e66b4 vllm-0.10.1+gptoss-cp38-abi3-linux_x86_64.whl
Appearently the vllm container takes different md5sum even it's same name and also the container is using newer openai_harmony version than upstream pypi index.
Workaround
- download docker container image from docker hub
- pick up all wheels under
/vllm-workspace
to the specific dir. (e.g.docker cp
) - install the wheels instead of upstream (e.g.
pip install *.whl --extra-index-url https://download.pytorch.org/whl/nightly/cu128
)
You may need --force-reinstall
flag if you already done to install with upstream because the vllm package name is same with the upstream.
Hope my investigation helps guys until the upsream provides the fixed version.
In my case, upgrading the CUDA version to 12.8 on the H100, made it work.
Also, I use --force-reinstall
, too.
I understand that upgrading to FlashAttention-3 should resolve the issue.
However, I couldn’t find any pre-built wheels for FlashAttention-3, and compiling from source is extremely slow (and often gets killed due to high memory usage). This makes it very difficult to test or deploy on my current setup.
If anyone has a pre-built wheel for PyTorch 2.9 + CUDA 12.8/12.9, I’d really appreciate it!
Just: uv pip install --pre vllm==0.10.1+gptoss
--extra-index-url https://wheels.vllm.ai/gpt-oss/
--extra-index-url https://download.pytorch.org/whl/nightly/cu128
--index-strategy unsafe-best-match --force-reinstall --no-cache
we ended up customizing vllm along with plugging in nvidia toolkit because the model uses nvccc directly.
Here's a working inference example http://playground.tracto.ai/playground?pr=notebooks/bulk-inference-gpt-oss-120b . feel free to run with it
if you are hopper or Blackwell please make sure you are on 0.10.1+gptoss
i.e.
uv pip install --pre vllm==0.10.1+gptoss
--extra-index-url https://wheels.vllm.ai/gpt-oss/
--extra-index-url https://download.pytorch.org/whl/nightly/cu128
--index-strategy unsafe-best-match --force-reinstall --no-cache
Just: uv pip install --pre vllm==0.10.1+gptoss
--extra-index-url https://wheels.vllm.ai/gpt-oss/
--extra-index-url https://download.pytorch.org/whl/nightly/cu128
--index-strategy unsafe-best-match --force-reinstall --no-cache
This worked for me,
But I had to do one additional step...
I completely wiped my venv
rm -rf path/to/gptoss-venv
and re-created fresh
uv venv path/to/gptoss-venv
Followed by
uv pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
--index-strategy unsafe-best-match --force-reinstall --no-cache
And I'm up and running finally
Please reinstall. There was a 2hr window yesterday that the wheels has the wrong FA interface copied in. Sorry about the trouble. 🙇
Encountered the same issue when specifying "flash_attention_2" for the attn_implementation argument. The issue was resolved by either omitting the argument or explicitly setting it to "sdpa".
from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "openai/gpt-oss-20b" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype="auto", attn_implementation="sdpa", ) messages = [ {"role": "user", "content": "Explain quantum mechanics clearly and concisely."}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, return_tensors="pt", return_dict=True, ).to(model.device) generated = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1] :]))
it seems that "sdpa" doesn't support gpt-oss
i got it on RTX 4090
raise ValueError(
ValueError: GptOssForCausalLM does not support an attention implementation through torch.nn.functional.scaled_dot_product_attention yet. Please request the support for this architecture: https://github.com/huggingface/transformers/issues/28005. If you believe this error is a bug, please open an issue in Transformers GitHub repository and load your model with the argument attn_implementation="eager"
meanwhile. Example: model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="eager")