metadata

language:
  - en
license: mit
tags:
  - gpt-oss
  - openai
  - mxfp4
  - mixture-of-experts
  - causal-lm
  - text-generation
  - cpu-gpu-offload
  - colab
datasets:
  - openai/gpt-oss-training-data
pipeline_tag: text-generation

gpt-oss-20b-offload

This is a CPU+GPU offload‑ready copy of OpenAI’s GPT‑OSS‑20B model, an open‑source, Mixture‑of‑Experts large language model released by OpenAI in 2025.
The model here retains OpenAI’s original MXFP4 quantization and is configured for memory‑efficient loading in Colab or similar GPU environments.

Model Details

Model Description

Developed by: OpenAI
Shared by: saurabh-srivastava (Hugging Face user)
Model type: Decoder‑only transformer (Mixture‑of‑Experts) for causal language modeling
Active experts per token: 4 / 32 total experts
Language(s): English (with capability for multilingual text generation)
License: MIT (per OpenAI GPT‑OSS release)
Finetuned from model: openai/gpt-oss-20b (no additional fine‑tuning performed)

Model Sources

Original model repository: https://huggingface.co/openai/gpt-oss-20b
OpenAI announcement: https://openai.com/index/introducing-gpt-oss/

Uses

Direct Use

Text generation, summarization, and question answering.
Running inference in low‑VRAM environments using CPU+GPU offload.

Downstream Use

Fine‑tuning for domain‑specific assistants.
Integration into chatbots or generative applications.

Out‑of‑Scope Use

Generating harmful, biased, or false information.
Any high‑stakes decision‑making without human oversight.

Bias, Risks, and Limitations

Like all large language models, GPT‑OSS‑20B can:

Produce factually incorrect or outdated information.
Reflect biases present in its training data.
Generate harmful or unsafe content if prompted.

Recommendations

Always use with a moderation layer.
Validate outputs for factual accuracy before use in production.

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "your-username/gpt-oss-20b-offload"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load with CPU+GPU offload
max_mem = {0: "20GiB", "cpu": "64GiB"}
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    max_memory=max_mem
)

inputs = tokenizer("Explain GPT‑OSS‑20B in one paragraph.", return_tensors="pt").to(0)
outputs = model.generate(**inputs, max_new_tokens=80)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))