Improved Image Tokenizer

This is an improved image tokenizer of NextStep-1, featuring a fine-tuned decoder with a frozen encoder. The decoder refinement improves performance while preserving robust reconstruction quality. We recommend using this Image Tokenizer for optimal results with NextStep-1 models.

Usage

import torch
from PIL import Image
import numpy as np
import torchvision.transforms as transforms

from modeling_flux_vae import AutoencoderKL

device = "cuda"
dtype = torch.bfloat16

model_path = "/path/to/vae_dir"
vae = AutoencoderKL.from_pretrained(model_path).to(device=device, dtype=dtype)

pil2tensor = transforms.Compose(
    [
        transforms.ToTensor(),
        transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),
    ]
)

image = Image.open("/path/to/image.jpg")
pixel_values = pil2tensor(image).unsqueeze(0).to(device=device, dtype=dtype)

# encode
latents = vae.encode(pixel_values).latent_dist.sample()

# decode
sampled_images = vae.decode(latents).sample
sampled_images = sampled_images.detach().cpu().to(torch.float32)

def tensor_to_pil(tensor):
    image = tensor.detach().cpu().to(torch.float32)
    image = (image / 2 + 0.5).clamp(0, 1)
    image = image.mul(255).round().to(dtype=torch.uint8)
    image = image.permute(1, 2, 0).numpy()
    return Image.fromarray(image, mode="RGB")

rec_image = tensor_to_pil(sampled_images[0])
rec_image.save("/path/to/output.jpg")

Evaluation

Reconstruction Performance on ImageNet-1K 256×256

Tokenizer Latent Shape PSNR ↑ SSIM ↑
Discrete Tokenizers
SBER-MoVQGAN (270M) 32×32 27.04 0.74
LlamaGen 32×32 24.44 0.77
VAR 680 22.12 0.62
TiTok-S-128 128 17.52 0.44
Sefltok 1024 26.30 0.81
Continuous Tokenizers
Stable Diffusion 1.5 32×32×4 25.18 0.73
Stable Diffusion XL 32×32×4 26.22 0.77
Stable Diffusion 3 Medium 32×32×16 30.00 0.88
Flux.1-dev 32×32×16 31.64 0.91
NextStep-1 32×32×16 30.60 0.89

Robustness of NextStep-1-f8ch16-Tokenizer

Impact of Noise Perturbation on Image Tokenizer Performance. The top panel displays quantitative metrics (rFID↓, PSNR↑, and SSIM↑) versus noise intensity. The bottom panel presents qualitative reconstruction examples at noise standard deviations of 0.2 and 0.5.

arch.
Downloads last month
41
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including stepfun-ai/NextStep-1-f8ch16-Tokenizer