If you have issues running gpt-oss-*20b in google's colab notebooks, this might be useful to know
at first, i couldn't run the gpt-oss-20b model on google's colab notebooks because it kept showing errors at different steps of the process.
it turns out the versions provided by default by colab of the different dependencies of the model are different from what's expected. this created all sorts of conflicts.
to solve all of these issues and run the model just make sure to use this command to install said dependencies:!pip install -U transformers>=4.55.0 kernels torch==2.6.0
.
also, make sure the runtime uses a gpu as the model can't run on tpu's.
I keep encountering the following error. Has anyone else run into this?
KeyError Traceback (most recent call last)
/usr/local/lib/python3.11/dist-packages/transformers/models/auto/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
1264 config_class = get_class_from_dynamic_module(
-> 1265 class_ref, pretrained_model_name_or_path, code_revision=code_revision, **kwargs
1266 )
3 frames
KeyError: 'gpt_oss'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
/usr/local/lib/python3.11/dist-packages/transformers/models/auto/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
1265 class_ref, pretrained_model_name_or_path, code_revision=code_revision, **kwargs
1266 )
-> 1267 config_class.register_for_auto_class()
1268 return config_class.from_pretrained(pretrained_model_name_or_path, **kwargs)
1269 elif "model_type" in config_dict:
ValueError: The checkpoint you are trying to load has model type gpt_oss
but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
You can update Transformers with the command pip install --upgrade transformers
. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command pip install git+https://github.com/huggingface/transformers.git
Thank you, that is super useful!
I keep finding this warning "MXFP4 quantization requires triton >= 3.4.0 and triton_kernels installed, we will default to dequantizing the model to bf16". But it seems triton=3.4.0 is not compatible with torch=2.6.0. How did you solve this?
Is there any way to solve this problem?
WARNING:accelerate.big_modeling:Some parameters are on the meta device because they were offloaded to the cpu and disk.
Falling back to torch.float32 because loading with the original dtype failed on the target device.
Device set to use cpu
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/tmp/ipython-input-834885792.py in <cell line: 0>()
3 ]
4
--> 5 outputs = pipe(
6 messages,
7 max_new_tokens=256,
/usr/local/lib/python3.11/dist-packages/accelerate/utils/offload.py in __getitem__(self, key)
163 if key in self.state_dict:
164 return self.state_dict[key]
-> 165 weight_info = self.index[key]
166 if weight_info.get("safetensors_file") is not None:
167 device = "cpu" if self.device is None else self.device
KeyError: 'model.layers.0.mlp.experts.gate_up_proj'
Also I have an issue. I cannot run 20B GPT Oss version on A100 working environment.
OutOfMemoryError Traceback (most recent call last)
/tmp/ipython-input-1055119348.py in <cell line: 0>()
2 from transformers import pipeline
3
----> 4 pipe = pipeline("text-generation", model="openai/gpt-oss-20b")
5 messages = [
6 {"role": "user", "content": "Who are you?"},
7 frames
/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py in convert(t)
1327 memory_format=convert_to_format,
1328 )
-> 1329 return t.to(
1330 device,
1331 dtype if t.is_floating_point() or t.is_complex() else None,
OutOfMemoryError: CUDA out of memory. Tried to allocate 1.08 GiB. GPU 0 has a total capacity of 39.56 GiB of which 1.04 GiB is free. Process 25354 has 38.51 GiB memory in use. Of the allocated memory 37.88 GiB is allocated by PyTorch, and 154.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
MXFP4 quantization requires triton >= 3.4.0 and triton_kernels installed, we will default to dequantizing the model to bf16 <-- this error makes huggingface load the full dequantized model which is too large, please how do we resolve this, it says it needs triton 3.40 but after updating triton its still not working
"MXFP4 is supported in Hopper or later architectures. This includes data center GPUs such as H100 or GB200, as well as the latest RTX 50xx family of consumer cards." :(
Update:
They updated the guide
pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
for triton kernels
and use dev version of transformers v4.56.devpip install git+https://github.com/huggingface/transformers.git
Did it on RTX 4090!
I cannot find a way to run this at less than 48 GB with Windows. If you run it without Triton it dequantizes and, instead of requiring ~16 GB, it needs ~48 GB. Triton seems to be for a Linux environment.
Please Read this. I was able to run the model in Colab using ollama. https://medium.com/ai-simplified-in-plain-english/openai-gpt-oss-20b-the-power-of-collaboration-exploring-multi-agent-ai-systems-99ebc0df5a68
I got it to work on 4090
To resolve this error: MXFP4 quantization requires triton >= 3.4.0 and triton_kernels installed, we will default to dequantizing the model to bf16
run: pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
Then I ran into this error: MXFP4 quantized models is only supported on GPUs with compute capability >= 9.0 (e.g H100, or B100)
run: pip install git+https://github.com/huggingface/transformers.git
Then it could run without error using transformers pipeline, using < 16G VRAM
I got it to work on 4090
To resolve this error: MXFP4 quantization requires triton >= 3.4.0 and triton_kernels installed, we will default to dequantizing the model to bf16
run: pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
Then I ran into this error: MXFP4 quantized models is only supported on GPUs with compute capability >= 9.0 (e.g H100, or B100)
run: pip install git+https://github.com/huggingface/transformers.git
Then it could run without error using transformers pipeline, using < 16G VRAM
This works for me, the VRAM used is 14.4GB
I am running into Memory errors when i run the model , anyone who has managed to by pass the error?
"MXFP4 is supported in Hopper or later architectures. This includes data center GPUs such as H100 or GB200, as well as the latest RTX 50xx family of consumer cards." :(
Update:
They updated the guide
pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
for triton kernels
and use dev version of transformers v4.56.dev
pip install git+https://github.com/huggingface/transformers.git
Did it on RTX 4090!
sorry, it doesn't work for me on RTX 4090.
before i do that, it reports "MXFP4 quantization requires triton >= 3.4.0 and kernels installed, we will default to dequantizing the model to bf16", even through my triton >= 3.4.0
after i do "pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels", it reports "ValueError: MXFP4 quantized models is only supported on GPUs with compute capability >= 9.0 (e.g H100, or B100)"
after i do "pip install git+ https://github.com/huggingface/transformers.git" and get "Successfully installed transformers-4.56.0.dev0", it reports "MXFP4 quantization requires triton >= 3.4.0 and kernels installed, we will default to dequantizing the model to bf16" again
"MXFP4 is supported in Hopper or later architectures. This includes data center GPUs such as H100 or GB200, as well as the latest RTX 50xx family of consumer cards." :(
Update:
They updated the guide
pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
for triton kernels
and use dev version of transformers v4.56.dev
pip install git+https://github.com/huggingface/transformers.git
Did it on RTX 4090!
sorry, it doesn't work for me on RTX 4090.
before i do that, it reports "MXFP4 quantization requires triton >= 3.4.0 and kernels installed, we will default to dequantizing the model to bf16", even through my triton >= 3.4.0
after i do "pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels", it reports "ValueError: MXFP4 quantized models is only supported on GPUs with compute capability >= 9.0 (e.g H100, or B100)"
after i do "pip install git+ https://github.com/huggingface/transformers.git" and get "Successfully installed transformers-4.56.0.dev0", it reports "MXFP4 quantization requires triton >= 3.4.0 and kernels installed, we will default to dequantizing the model to bf16" again
i sovle it !!! need "pip install kernels"
same error for me... ValueError: The checkpoint you are trying to load has model type gpt_oss
but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
I'm using google colab and I've already tried updating the transformers, I've already run !pip install -U transformers>=4.55.0 kernels torch==2.6.0 as mentioned before and I still have the same error, help please
Trying to run gptoss-20B on T4 colab, facing below memory issue, was anyone able to resolve this?
Loading checkpoint shards: 0%
0/3 [00:00<?, ?it/s]
---------------------------------------------------------------------------
OutOfMemoryError Traceback (most recent call last)
/tmp/ipython-input-2717120482.py in <cell line: 0>()
4
5 tokenizer = AutoTokenizer.from_pretrained(model_id)
----> 6 model = AutoModelForCausalLM.from_pretrained(
7 model_id,
8 torch_dtype="auto",
9 frames
/usr/local/lib/python3.11/dist-packages/transformers/integrations/mxfp4.py in convert_moe_packed_tensors(blocks, scales, dtype, rows_per_chunk)
121
122 # nibble indices -> int64
--> 123 idx_lo = (blk & 0x0F).to(torch.long)
124 idx_hi = (blk >> 4).to(torch.long)
125
OutOfMemoryError: CUDA out of memory. Tried to allocate 1.98 GiB. GPU 0 has a total capacity of 14.74 GiB of which 1.47 GiB is free. Process 7925 has 13.27 GiB memory in use. Of the allocated memory 11.95 GiB is allocated by PyTorch, and 1.21 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Have tried below things to free up memory still no luck
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"]="expandable_segments:True"
import gc
torch.cuda.empty_cache()
gc.collect()