openai/gpt-oss-20b · If you have issues running gpt-oss-*20b in google's colab notebooks, this might be useful to know

16 days ago

at first, i couldn't run the gpt-oss-20b model on google's colab notebooks because it kept showing errors at different steps of the process.

it turns out the versions provided by default by colab of the different dependencies of the model are different from what's expected. this created all sorts of conflicts.

to solve all of these issues and run the model just make sure to use this command to install said dependencies:
!pip install -U transformers>=4.55.0 kernels torch==2.6.0.

also, make sure the runtime uses a gpu as the model can't run on tpu's.

ahmedsaoudi changed discussion title from If you have issues running the model in google's colab notebooks, this might be useful to know to If you have issues running gpt-oss-*20b in google's colab notebooks, this might be useful to know 16 days ago

gcli0313

16 days ago

I keep encountering the following error. Has anyone else run into this?

KeyError Traceback (most recent call last)
/usr/local/lib/python3.11/dist-packages/transformers/models/auto/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
1264 config_class = get_class_from_dynamic_module(
-> 1265 class_ref, pretrained_model_name_or_path, code_revision=code_revision, **kwargs
1266 )

3 frames
KeyError: 'gpt_oss'

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
/usr/local/lib/python3.11/dist-packages/transformers/models/auto/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
1265 class_ref, pretrained_model_name_or_path, code_revision=code_revision, **kwargs
1266 )
-> 1267 config_class.register_for_auto_class()
1268 return config_class.from_pretrained(pretrained_model_name_or_path, **kwargs)
1269 elif "model_type" in config_dict:

ValueError: The checkpoint you are trying to load has model type gpt_oss but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

You can update Transformers with the command pip install --upgrade transformers. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command pip install git+https://github.com/huggingface/transformers.git

rodrigocarrillo

16 days ago

Thank you, that is super useful!
I keep finding this warning "MXFP4 quantization requires triton >= 3.4.0 and triton_kernels installed, we will default to dequantizing the model to bf16". But it seems triton=3.4.0 is not compatible with torch=2.6.0. How did you solve this?

knighjok

16 days ago

Is there any way to solve this problem?

WARNING:accelerate.big_modeling:Some parameters are on the meta device because they were offloaded to the cpu and disk.
Falling back to torch.float32 because loading with the original dtype failed on the target device.
Device set to use cpu

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipython-input-834885792.py in <cell line: 0>()
      3 ]
      4 
--> 5 outputs = pipe(
      6     messages,
      7     max_new_tokens=256,

/usr/local/lib/python3.11/dist-packages/accelerate/utils/offload.py in __getitem__(self, key)
     163         if key in self.state_dict:
     164             return self.state_dict[key]
-> 165         weight_info = self.index[key]
     166         if weight_info.get("safetensors_file") is not None:
     167             device = "cpu" if self.device is None else self.device

KeyError: 'model.layers.0.mlp.experts.gate_up_proj'

fenrir37

15 days ago

Also I have an issue. I cannot run 20B GPT Oss version on A100 working environment.

OutOfMemoryError Traceback (most recent call last)
/tmp/ipython-input-1055119348.py in <cell line: 0>()
2 from transformers import pipeline
3
----> 4 pipe = pipeline("text-generation", model="openai/gpt-oss-20b")
5 messages = [
6 {"role": "user", "content": "Who are you?"},

7 frames
/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py in convert(t)
1327 memory_format=convert_to_format,
1328 )
-> 1329 return t.to(
1330 device,
1331 dtype if t.is_floating_point() or t.is_complex() else None,

OutOfMemoryError: CUDA out of memory. Tried to allocate 1.08 GiB. GPU 0 has a total capacity of 39.56 GiB of which 1.04 GiB is free. Process 25354 has 38.51 GiB memory in use. Of the allocated memory 37.88 GiB is allocated by PyTorch, and 154.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

kosiasuzu

15 days ago

MXFP4 quantization requires triton >= 3.4.0 and triton_kernels installed, we will default to dequantizing the model to bf16 <-- this error makes huggingface load the full dequantized model which is too large, please how do we resolve this, it says it needs triton 3.40 but after updating triton its still not working

permitt

15 days ago

•

edited 14 days ago

"MXFP4 is supported in Hopper or later architectures. This includes data center GPUs such as H100 or GB200, as well as the latest RTX 50xx family of consumer cards." :(

Update:
They updated the guide

pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

for triton kernels

and use dev version of transformers v4.56.dev
pip install git+https://github.com/huggingface/transformers.git

Did it on RTX 4090!

kalbaugh

15 days ago

I cannot find a way to run this at less than 48 GB with Windows. If you run it without Triton it dequantizes and, instead of requiring ~16 GB, it needs ~48 GB. Triton seems to be for a Linux environment.

kosiasuzu

15 days ago

exactly @permitt ! please can you share how you did your cpu offloading?

frankmorales2020

15 days ago

Please Read this. I was able to run the model in Colab using ollama. https://medium.com/ai-simplified-in-plain-english/openai-gpt-oss-20b-the-power-of-collaboration-exploring-multi-agent-ai-systems-99ebc0df5a68

kuanweichen

15 days ago

•

edited 15 days ago

I got it to work on 4090
To resolve this error: MXFP4 quantization requires triton >= 3.4.0 and triton_kernels installed, we will default to dequantizing the model to bf16
run: pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
Then I ran into this error: MXFP4 quantized models is only supported on GPUs with compute capability >= 9.0 (e.g H100, or B100)
run: pip install git+https://github.com/huggingface/transformers.git
Then it could run without error using transformers pipeline, using < 16G VRAM

Supphachok

14 days ago

I got it to work on 4090
To resolve this error: MXFP4 quantization requires triton >= 3.4.0 and triton_kernels installed, we will default to dequantizing the model to bf16
run: pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
Then I ran into this error: MXFP4 quantized models is only supported on GPUs with compute capability >= 9.0 (e.g H100, or B100)
run: pip install git+https://github.com/huggingface/transformers.git
Then it could run without error using transformers pipeline, using < 16G VRAM

This works for me, the VRAM used is 14.4GB

Washere-1

14 days ago

I am running into Memory errors when i run the model , anyone who has managed to by pass the error?

NooBaymax

14 days ago

"MXFP4 is supported in Hopper or later architectures. This includes data center GPUs such as H100 or GB200, as well as the latest RTX 50xx family of consumer cards." :(

Update:
They updated the guide

pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

for triton kernels

and use dev version of transformers v4.56.dev
pip install git+https://github.com/huggingface/transformers.git

Did it on RTX 4090!

sorry, it doesn't work for me on RTX 4090.
before i do that, it reports "MXFP4 quantization requires triton >= 3.4.0 and kernels installed, we will default to dequantizing the model to bf16", even through my triton >= 3.4.0
after i do "pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels", it reports "ValueError: MXFP4 quantized models is only supported on GPUs with compute capability >= 9.0 (e.g H100, or B100)"
after i do "pip install git+ https://github.com/huggingface/transformers.git" and get "Successfully installed transformers-4.56.0.dev0", it reports "MXFP4 quantization requires triton >= 3.4.0 and kernels installed, we will default to dequantizing the model to bf16" again

NooBaymax

13 days ago

"MXFP4 is supported in Hopper or later architectures. This includes data center GPUs such as H100 or GB200, as well as the latest RTX 50xx family of consumer cards." :(

Update:
They updated the guide

pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

for triton kernels

and use dev version of transformers v4.56.dev
pip install git+https://github.com/huggingface/transformers.git

Did it on RTX 4090!

sorry, it doesn't work for me on RTX 4090.
before i do that, it reports "MXFP4 quantization requires triton >= 3.4.0 and kernels installed, we will default to dequantizing the model to bf16", even through my triton >= 3.4.0
after i do "pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels", it reports "ValueError: MXFP4 quantized models is only supported on GPUs with compute capability >= 9.0 (e.g H100, or B100)"
after i do "pip install git+ https://github.com/huggingface/transformers.git" and get "Successfully installed transformers-4.56.0.dev0", it reports "MXFP4 quantization requires triton >= 3.4.0 and kernels installed, we will default to dequantizing the model to bf16" again

i sovle it !!! need "pip install kernels"

lremor

13 days ago

same error for me... ValueError: The checkpoint you are trying to load has model type gpt_oss but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
I'm using google colab and I've already tried updating the transformers, I've already run !pip install -U transformers>=4.55.0 kernels torch==2.6.0 as mentioned before and I still have the same error, help please

sumeetm

12 days ago

•

edited 12 days ago

Trying to run gptoss-20B on T4 colab, facing below memory issue, was anyone able to resolve this?

Loading checkpoint shards:   0%
 0/3 [00:00<?, ?it/s]
---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
/tmp/ipython-input-2717120482.py in <cell line: 0>()
      4 
      5 tokenizer = AutoTokenizer.from_pretrained(model_id)
----> 6 model = AutoModelForCausalLM.from_pretrained(
      7     model_id,
      8     torch_dtype="auto",
9 frames
/usr/local/lib/python3.11/dist-packages/transformers/integrations/mxfp4.py in convert_moe_packed_tensors(blocks, scales, dtype, rows_per_chunk)
    121 
    122         # nibble indices -> int64
--> 123         idx_lo = (blk & 0x0F).to(torch.long)
    124         idx_hi = (blk >> 4).to(torch.long)
    125 

OutOfMemoryError: CUDA out of memory. Tried to allocate 1.98 GiB. GPU 0 has a total capacity of 14.74 GiB of which 1.47 GiB is free. Process 7925 has 13.27 GiB memory in use. Of the allocated memory 11.95 GiB is allocated by PyTorch, and 1.21 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Have tried below things to free up memory still no luck

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"]="expandable_segments:True"

import gc
torch.cuda.empty_cache()
gc.collect()