Request: 4-bit GPTQ or AWQ quantized version of openai/gpt-oss-20b

#32
by powtac - opened

Is there any plan or ongoing effort to release a 4-bit quantized version (e.g. GPTQ or AWQ) of openai/gpt-oss-20b?
A quantized format would make it feasible to run this model on consumer GPUs (e.g. 4–8 GB VRAM) and compact CPU inference setups via GGUF/llama.cpp.
Would appreciate any pointers or upcoming availability.

it is quantized to 4bit. It was trained at 4bit

You can actually run gpt-oss-20b locally using Ollama with their MXFP4 quantized version — no manual setup required.

ollama pull gpt-oss:20b
ollama run gpt-oss:20b

This works well on systems with ≥16 GB VRAM or unified memory.

Official guide from OpenAI: https://cookbook.openai.com/articles/gpt-oss/run-locally-ollama

Looking at this again, I would like a FP8 version because Ada and 4X series GPUs do not currently work with MXFP4

but why vllm can not run on 2080ti, it said mxfp4 not support in sm75

There is an optimized and quantized ONNX model for gpt-oss-20b, and it is available through Foundry Local and AI Toolkit for VS Code. Please see the official OpenAI announcement for more details. I have also uploaded the model to Hugging Face here.

Is there any plan or ongoing effort to release a 4-bit quantized version (e.g. GPTQ or AWQ) of openai/gpt-oss-20b?

gpt-oss were trained with native MXFP4 quantization, so you can load the gguf versions at full precision.
llama-server -hf ggml-org/gpt-oss-20b-GGUF

Issue is running either vllm or sglang on non hopper pr blackwell arch.

With transformers main, it should even work on a T4 ! Please try to following google colab: https://colab.research.google.com/drive/15DJv6QWgc49MuC7dlNS9ifveXBDjCWO5?usp=sharing

Looking at this again, I would like a FP8 version because Ada and 4X series GPUs do not currently work with MXFP4

Same here, would like to run it on my 4xA6000 Ada with vLLM, but not working for now as they don't support MXFP4. There is no AWQ/GPTQ either, with bnb TP won't work.
Might try GUFF but I think it's experimental still.

Update:

I got it working following the procedure intended for Ampere. This is the summerized steps by LLM:

# ======================================================================
# gpt-oss on vLLM (Ada Lovelace, 4× RTX 6000/A6000) — CUDA 12.8, MXFP4
# One-file setup & run guide with comments
# Tested with: Ubuntu 24.04, Python 3.12, Driver 580.xx (CUDA 13 driver)
# ======================================================================

# -------------------------------
# [0] (Optional) Install CUDA *toolkit* 12.8 side-by-side (NO driver change)
#     Safe to coexist with your existing CUDA 13.x driver.
# -------------------------------
# Skip if /usr/local/cuda-12.8 already exists.
sudo apt-get update && sudo apt-get install -y wget gnupg
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb && sudo apt-get update
sudo apt-get install -y cuda-toolkit-12-8

# Point the current shell at CUDA 12.8 (only affects this session):
export CUDA_HOME=/usr/local/cuda-12.8
export CUDACXX=$CUDA_HOME/bin/nvcc
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:${LD_LIBRARY_PATH:-}

# -------------------------------
# [1] Create a clean Python venv and base pins
# -------------------------------
python3 -m venv ~/vllm-venv
source ~/vllm-venv/bin/activate

python -m pip install --upgrade pip
# uv = fast pip, used by the official recipe; numpy<2.3 avoids Numba breakage.
pip install uv
pip install "setuptools>=77.0.3,<80" "numpy<2.3"

# -------------------------------
# [2] Install vLLM (GPT-OSS wheel) with cu128 PyTorch *nightlies*
#     This is the exact pattern used in the official tutorial.
# -------------------------------
uv pip install --pre "vllm==0.10.1+gptoss" \
  --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
  --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
  --index-strategy unsafe-best-match

# -------------------------------
# [3] FlashInfer v2 (prebuilt) for fast sampling on Ada
#     Using the prebuilt wheel avoids fragile JIT builds.
# -------------------------------
uv pip install "flashinfer-python==0.2.10"

# -------------------------------
# [4 ] skipped, personalization stuff

# -------------------------------
# [5] Ada-safe attention backend (avoid FA3+sinks) + optional arch hint
#     Triton unified attention is the stable path today for Ada.
# -------------------------------
export VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1
export TORCH_CUDA_ARCH_LIST="8.9"     # Ada (RTX 6000/A6000 Ada / L40S)

# -------------------------------
# [6] Quick environment sanity check (optional)
# -------------------------------
python - <<'PY'
import torch, vllm, pkgutil, numpy, importlib
print("Torch:", torch.__version__, "| CUDA:", torch.version.cuda, "| cuda?", torch.cuda.is_available())
print("vLLM:", vllm.__version__)
try: print("Triton:", importlib.metadata.version("triton"))
except Exception as e: print("Triton: (not found)", e)
print("FlashInfer present:", pkgutil.find_loader("flashinfer") is not None)
print("NumPy:", numpy.__version__)
print("GPUs:", torch.cuda.device_count(), [torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())])
PY

# -------------------------------
# [7] 20B smoke-serve (uses local cache; TP=4 across 4 GPUs)
#     --async-scheduling recommended; vLLM will disable custom allreduce automatically for >2 PCIe GPUs.
# -------------------------------
vllm serve openai/gpt-oss-20b \
  --tensor-parallel-size 4 \
  --async-scheduling \
  --gpu-memory-utilization 0.90 &
VLLM_PID=$!

an awq version will be awesome 💪🚀

Sign up or log in to comment