openai/gpt-oss-20b · Running gpt-oss-20b on an RTX 4070 Ti (12GB) using Transformers

Hi everyone,
I'd like to share the method I used to run the gpt-oss-20b model on a single RTX 4070Ti (12GB VRAM) using the transformers library.

First, as it says in the guide, the MXFP4 quantized model cannot be used on 40-series cards.
Therefore, you need to get the original model by giving it the option to de-quantize.
In this case, if you use device_map='auto', a KeyError will occur because the model is distributed across multiple devices.
Therefore, you should load it only on the CPU. After that, save it locally.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils.quantization_config import Mxfp4Config

model_id = "openai/gpt-oss-20b"
save_path = './gpt-oss-model-local'
try:

    quantization_config = Mxfp4Config(dequantize=True)

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=quantization_config,
        torch_dtype=torch.bfloat16,
        device_map="cpu"
    )

    tokenizer = AutoTokenizer.from_pretrained(model_id)


    model.save_pretrained(save_path)
    tokenizer.save_pretrained(save_path)


except Exception as e:
    print(e)

Next, load the de-quantized model you saved locally by quantizing it with bnb.
If you use device_map='auto', a VRAM OOM will occur, so you need to map it manually.
In the case of the 4070 Ti, it was possible to put a maximum of 15 layers on the GPU, and setting it higher than this caused an OOM.

For those of you using a better 40-series GPU, you can modify this part (and I would be grateful if you could let me know the results).


import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import time


model_path = './gpt-oss-model-local'

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,     
    bnb_4bit_quant_type="nf4",           
    bnb_4bit_compute_dtype=torch.bfloat16,
    llm_int8_enable_fp32_cpu_offload=True
)

num_gpu_layers = 15
num_total_layers = 24


device_map = {
    "model.embed_tokens": 0,
    **{f"model.layers.{i}": 0 for i in range(num_gpu_layers)},
    **{f"model.layers.{i}": "cpu" for i in range(num_gpu_layers, num_total_layers)},
    "model.norm": "cpu",
    "lm_head": "cpu"
}

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=quantization_config,
    # attn_implementation="flash_attention_3",
    # attn_implementation="sdpa",
    device_map=device_map
)

print(f"current attention impl: {model.config._attn_implementation}")

tokenizer = AutoTokenizer.from_pretrained(model_path)

messages = [
    {"role": "user", "content": "Explain what MXFP4 quantization is in simple terms."},
]


inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)


max_new_tokens = 1
start_time = time.perf_counter()
outputs = model.generate(
    **inputs,
    max_new_tokens=max_new_tokens,
    temperature=0.7
)

print(tokenizer.decode(outputs[0]))
end_time = time.perf_counter()
elapsed_time = end_time - start_time
print("inf end")


print(tokenizer.decode(outputs[0]))
print("\n" + "="*30)
print(f"elapsed time: {elapsed_time:.2f}sec")
print("="*30)

Additionally, it seems flash attention does not work. Therefore, the basic operation is eager. If there's anyone who has it working, I would appreciate it if you could share.

When set to a reasoning:medium level, it took about 4 to 5 minutes to infer a single token.
As a result, it seems difficult to use gpt-oss-20b at a 4-bit quantization level on a GPU like the RTX 4070 Ti.

If there's anything I've missed, or if anyone has had success operating this on a RTX 40 GPU, please be sure to let me know.

Thanks