gpt-oss-20b Nvdidia A10G error deploy Attention Sinks

#120

by Brayanb - opened 8 days ago

8 days ago

This is the main issue. The gpt-oss-20b model uses a feature called “Attention Sinks”, which vLLM can only run with FlashAttention 3. Problem is, FlashAttention 3 needs super-new GPU hardware) found only on Hopper/Blackwell GPUs—not your NVIDIA A10G (Ampere)
help, the deployment is in Modal.
ideas?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment