What’s MXFP4? The 4-Bit Secret Powering OpenAI’s GPT‑OSS Models on Modest Hardware

Community Article Published August 8, 2025

With native MXFP4 precision, GPT‑OSS‑120B runs on a single H100 GPU and GPT‑OSS‑20B fits within just 16 GB of memory.

What is MXFP4? The Definition and Its Origin Story

MXFP4 stands for Microscaling FP4, a next-generation 4-bit floating-point format created and standardized in early 2024 by the Open Compute Project (OCP). This initiative, backed by tech giants including AMD, NVIDIA, Microsoft, Meta, and OpenAI, set out to lower the hardware and compute barriers to cutting-edge AI.

Format: Each value is stored in just 4 bits, following the E2M1 layout: 1 sign bit, 2 exponent bits, 1 mantissa bit per parameter.
Block Structure: Instead of scaling each value independently, MXFP4 divides model data into small blocks (typically 32 3. elements) and assigns each block a single, shared 8‑bit exponential scaling factor a “microscaling” approach.
Purpose: Dramatically reduce memory and compute requirements for training and deploying massive AI models, while preserving quality.

The Magic Inside: How MXFP4 Works

What sets MXFP4 apart from previous quantization strategies is its clever combination of extreme compression with minimal loss of precision. Here’s how:

Block Formation: Model tensors are divided into blocks of 32 consecutive elements.
Common Scale: Each block uses a single 8‑bit shared scale, calculated to best fit all values in the block.
E2M1 Encoding: Each value in the block is quantized using 4 bits (E2M1 format sign, exponent, mantissa).
Reconstruction: The actual floating-point value is decoded as:

where Xi is the reconstructed floating-point value, Pi is the 4-bit FP4 quantized value (in E2M1 format), S is the shared scale.

This structure lets MXFP4 efficiently represent the wide dynamic range found in modern AI models even with only 4 bits per value while keeping storage overhead low. It’s a radical departure from uniform quantization.

Training, Not Just Inference: Advanced Techniques

For years, 4-bit quantization was considered “good enough” only for inference, not training. MXFP4 changed that by introducing robust methods to preserve gradient integrity:

Stochastic Rounding: Randomizes rounding direction, ensuring no systematic loss of information during training updates prevents bias and preserves learning progress.
Random Hadamard Transform: Redistributes values within a block before quantization, minimizing the impact of “outlier” values and helping gradients survive the quantization process.
Group-wise Quantization: Each block (32 values) strikes a critical balance between dynamic range and quantization error. These innovations enabled direct training of massive models in MXFP4 no more need to pre-train in high precision.

MXFP4 in Action: OpenAI’s GPT‑OSS Models

To prove that MXFP4 wasn’t just theory, OpenAI released the GPT‑OSS family, open-weight models trained natively with MXFP4:

These models demonstrate:

Massive compression: 120B params fit in 80GB VRAM; 20B models fit in just 16GB.
No quality compromise: Reasoning and coding benchmarks near parity with heavyweights trained in much higher precision.
Open access: Everything available under Apache 2.0, ready for production or research use.

Ecosystem Support: Beyond Proprietary Solutions

MXFP4 is a true open standard, not a vendor lock-in trick:

NVIDIA Blackwell: Native MXFP4 hardware support, double FP8 throughput.
NVIDIA Hopper (H100): Software-optimized support through Triton.
Broad adoption: Hugging Face, vLLM, Nvidia NIM, Ollama, and more.

Final Thoughts

MXFP4 isn’t just “smaller numbers” it’s the bridge between impossible and possible in AI. By packing more intelligence into fewer bits, and by making sure anyone can train and deploy powerful models, MXFP4 signals the democratization of AI is truly here.