DFloat11 Compressed Model: black-forest-labs/FLUX.1-Kontext-dev

This is a DFloat11 losslessly compressed version of the original black-forest-labs/FLUX.1-Kontext-dev model. It reduces model size by 32% compared to the original BFloat16 model, while maintaining bit-identical outputs and supporting efficient GPU inference.

🔥🔥🔥 Thanks to DFloat11 compression, FLUX.1-Kontext-dev can now run smoothly on a single 24GB GPU without any quality loss. 🔥🔥🔥

📊 Performance Comparison

Metric FLUX.1-Kontext-dev (BFloat16) FLUX.1-Kontext-dev (DFloat11)
Model Size 23.80 GB 16.33 GB
Peak GPU Memory
(1024×1024 image generation)
24.86 GB 18.12 GB
Generation Time
(A100 GPU)
72 seconds 83 seconds

🔧 How to Use

  1. Install or upgrade the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed):

    pip install -U dfloat11[cuda12]
    # or if you have CUDA version 11:
    # pip install -U dfloat11[cuda11]
    
  2. Install diffusers from the main branch until future stable release.

    pip install git+https://github.com/huggingface/diffusers.git
    
  3. To use the DFloat11 model, run the following example code in Python:

    import torch
    from diffusers import FluxKontextPipeline
    from diffusers.utils import load_image
    from dfloat11 import DFloat11Model
    
    pipe = FluxKontextPipeline.from_pretrained("black-forest-labs/FLUX.1-Kontext-dev", torch_dtype=torch.bfloat16)
    DFloat11Model.from_pretrained(
        "DFloat11/FLUX.1-Kontext-dev-DF11",
        device="cpu",
        bfloat16_model=pipe.transformer,
    )
    pipe.enable_model_cpu_offload()
    
    input_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
    
    image = pipe(
        image=input_image,
        prompt="Add a hat to the cat",
        guidance_scale=2.5,
    ).images[0]
    
    image.save("kontext.png")
    

🔍 How It Works

We apply Huffman coding to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU.

The result is a model that is ~32% smaller, delivers bit-identical outputs, and achieves performance comparable to the original BFloat16 model.

Learn more in our research paper.

📄 Learn More

Downloads last month
12,245
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DFloat11/FLUX.1-Kontext-dev-DF11

Quantized
(12)
this model
Adapters
4 models

Space using DFloat11/FLUX.1-Kontext-dev-DF11 1

Collection including DFloat11/FLUX.1-Kontext-dev-DF11