[v1 engine][flash_attn backend] TypeError: flash_attn_varlen_func() got an unexpected keyword argument 's_aux' when running gpt-oss-120b on H200

#41

by RekklesAI - opened 15 days ago

15 days ago

vLLM v1 engine crashes at runtime when using the FlashAttention backend. The kernel call passes s_aux, but the installed flash-attn doesn’t accept this keyword, causing:

TypeError: flash_attn_varlen_func() got an unexpected keyword argument 's_aux'

Environment

OS: (e.g., Ubuntu 22.04)

Python: 3.12

vLLM: v0.10.2.dev2+gf5635d62e.d20250805 (from log)

PyTorch: 2.9.0.dev20250804+cu128

CUDA: 12.8

GPUs: (e.g., H200, H200)

Install method: uv pip install --pre vllm==0.10.1+gptoss
--extra-index-url https://wheels.vllm.ai/gpt-oss/
--extra-index-url https://download.pytorch.org/whl/nightly/cu128
--index-strategy unsafe-best-match

Command to Reproduce:
export VLLM_ATTENTION_BACKEND=flash-attn # default or implied
vllm serve openai/gpt-oss-120b
--host 0.0.0.0
--tensor-parallel-size 2

Send any chat/completions request; crash happens on first step

Key Logs: TypeError: flash_attn_varlen_func() got an unexpected keyword argument 's_aux'
at vllm/v1/attention/backends/flash_attn.py:526
...
torch._ops.vllm.unified_attention_with_output(...)

What I tried

Switching to torch-sdpa backend works as a workaround:
export VLLM_ATTENTION_BACKEND=torch-sdpa
Rebuilding flash-attn (current PyTorch is nightly 2.9 + CUDA 12.8; wheel not available).
Uninstalled flash-attn/flashinfer to avoid version conflicts.

Unfortunately, none of these steps resolved the issue. I'd appreciate any guidance on resolving this.

Se-Hun

15 days ago

I have the same issue. Why did that happen?

rsullenbLL

15 days ago

•

edited 15 days ago

I have the same issue (with 2x H100 GPUs)

(EngineCore_0 pid=) ERROR 08-06 00:32:38 [core.py:720] RuntimeError: Worker failed with error 'flash_attn_varlen_func() got an unexpected keyword argument 's_aux'', please check the stack trace above for the root cause
...
(APIServer pid=) ERROR 08-06 00:32:38 [serving_chat.py:1001] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

ChangHyunBae

15 days ago

Same for me

Hyukkyu

15 days ago

•

edited 15 days ago

Encountered the same issue when specifying "flash_attention_2" for the attn_implementation argument. The issue was resolved by either omitting the argument or explicitly setting it to "sdpa".

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
    attn_implementation="sdpa",
)

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1] :]))

bloodeagle40234

15 days ago

Investigation

I found there is the diffrence on vllm and openai-harmony from docker container which works well locally.

Here is md5sum of wheels provided in https://wheels.vllm.ai/gpt-oss/ set as extra-index in the gpt-oss guide.

a9c90e016ef0f04e963944f48c13da86  flashinfer_python-0.2.8-py3-none-any.whl
2be184d07d76ea6a1ad41e069de13d8f  gpt_oss-0.1.0-py3-none-any.whl
5f2093088f8a5c46fa010000dbde807b  triton-3.4.0+git663e04e8-cp38-abi3-linux_x86_64.whl
5efee9b5fa4409318695c7e9dc155282  triton_kernels-1.0.0-py3-none-any.whl
d790a1cbde026100db2d5626667454ce  vllm-0.10.1+gptoss-cp38-abi3-linux_x86_64.whl

However, the wheels inside of the working container are as follows:

a9c90e016ef0f04e963944f48c13da86  flashinfer_python-0.2.8-py3-none-any.whl
2be184d07d76ea6a1ad41e069de13d8f  gpt_oss-0.1.0-py3-none-any.whl
27d4a3918f6af4550bdb5f66bf68ae39  openai_harmony-0.1.0-cp39-abi3-manylinux_2_34_x86_64.whl
5f2093088f8a5c46fa010000dbde807b  triton-3.4.0+git663e04e8-cp38-abi3-linux_x86_64.whl
5efee9b5fa4409318695c7e9dc155282  triton_kernels-1.0.0-py3-none-any.whl
47d8963ec500eb6793191f1d3b3e66b4  vllm-0.10.1+gptoss-cp38-abi3-linux_x86_64.whl

Appearently the vllm container takes different md5sum even it's same name and also the container is using newer openai_harmony version than upstream pypi index.

Workaround

download docker container image from docker hub
pick up all wheels under /vllm-workspace to the specific dir. (e.g. docker cp)
install the wheels instead of upstream (e.g. pip install *.whl --extra-index-url https://download.pytorch.org/whl/nightly/cu128)

You may need --force-reinstall flag if you already done to install with upstream because the vllm package name is same with the upstream.

Hope my investigation helps guys until the upsream provides the fixed version.

Se-Hun

15 days ago

In my case, upgrading the CUDA version to 12.8 on the H100, made it work.
Also, I use --force-reinstall, too.

RekklesAI

15 days ago

I understand that upgrading to FlashAttention-3 should resolve the issue.
However, I couldn’t find any pre-built wheels for FlashAttention-3, and compiling from source is extremely slow (and often gets killed due to high memory usage). This makes it very difficult to test or deploy on my current setup.

If anyone has a pre-built wheel for PyTorch 2.9 + CUDA 12.8/12.9, I’d really appreciate it!

DSY001

15 days ago

•

edited 15 days ago

Just: uv pip install --pre vllm==0.10.1+gptoss
--extra-index-url https://wheels.vllm.ai/gpt-oss/
--extra-index-url https://download.pytorch.org/whl/nightly/cu128
--index-strategy unsafe-best-match --force-reinstall --no-cache

mburkov

15 days ago

we ended up customizing vllm along with plugging in nvidia toolkit because the model uses nvccc directly.
Here's a working inference example http://playground.tracto.ai/playground?pr=notebooks/bulk-inference-gpt-oss-120b . feel free to run with it

lwilkinson

15 days ago

if you are hopper or Blackwell please make sure you are on 0.10.1+gptoss

i.e.

uv pip install --pre vllm==0.10.1+gptoss
--extra-index-url https://wheels.vllm.ai/gpt-oss/
--extra-index-url https://download.pytorch.org/whl/nightly/cu128
--index-strategy unsafe-best-match --force-reinstall --no-cache

rsullenbLL

15 days ago

•

edited 15 days ago

Just: uv pip install --pre vllm==0.10.1+gptoss
--extra-index-url https://wheels.vllm.ai/gpt-oss/
--extra-index-url https://download.pytorch.org/whl/nightly/cu128
--index-strategy unsafe-best-match --force-reinstall --no-cache

This worked for me,
But I had to do one additional step...

I completely wiped my venv

rm -rf path/to/gptoss-venv

and re-created fresh

uv venv path/to/gptoss-venv

Followed by

uv pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
--index-strategy unsafe-best-match --force-reinstall --no-cache

And I'm up and running finally

simon-mo

14 days ago

Please reinstall. There was a 2hr window yesterday that the wheels has the wrong FA interface copied in. Sorry about the trouble. 🙇

NooBaymax

6 days ago

Encountered the same issue when specifying "flash_attention_2" for the attn_implementation argument. The issue was resolved by either omitting the argument or explicitly setting it to "sdpa".

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
    attn_implementation="sdpa",
)

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1] :]))

it seems that "sdpa" doesn't support gpt-oss
i got it on RTX 4090
raise ValueError(
ValueError: GptOssForCausalLM does not support an attention implementation through torch.nn.functional.scaled_dot_product_attention yet. Please request the support for this architecture: https://github.com/huggingface/transformers/issues/28005. If you believe this error is a bug, please open an issue in Transformers GitHub repository and load your model with the argument attn_implementation="eager" meanwhile. Example: model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="eager")

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment