Improved Image Tokenizer

This is an improved image tokenizer of NextStep-1, featuring a fine-tuned decoder with a frozen encoder. The decoder refinement improves performance while preserving robust reconstruction quality. We recommend using this Image Tokenizer for optimal results with NextStep-1 models.

Usage

import torch
from PIL import Image
import numpy as np
import torchvision.transforms as transforms

from modeling_flux_vae import AutoencoderKL

device = "cuda"
dtype = torch.bfloat16

model_path = "/path/to/vae_dir"
vae = AutoencoderKL.from_pretrained(model_path).to(device=device, dtype=dtype)

pil2tensor = transforms.Compose(
    [
        transforms.ToTensor(),
        transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),
    ]
)

image = Image.open("/path/to/image.jpg")
pixel_values = pil2tensor(image).unsqueeze(0).to(device=device, dtype=dtype)

# encode
latents = vae.encode(pixel_values).latent_dist.sample()

# decode
sampled_images = vae.decode(latents).sample
sampled_images = sampled_images.detach().cpu().to(torch.float32)

def tensor_to_pil(tensor):
    image = tensor.detach().cpu().to(torch.float32)
    image = (image / 2 + 0.5).clamp(0, 1)
    image = image.mul(255).round().to(dtype=torch.uint8)
    image = image.permute(1, 2, 0).numpy()
    return Image.fromarray(image, mode="RGB")

rec_image = tensor_to_pil(sampled_images[0])
rec_image.save("/path/to/output.jpg")

Evaluation

Reconstruction Performance on ImageNet-1K 256×256

Tokenizer	Latent Shape	PSNR ↑	SSIM ↑
Discrete Tokenizers
SBER-MoVQGAN (270M)	32×32	27.04	0.74
LlamaGen	32×32	24.44	0.77
VAR	680	22.12	0.62
TiTok-S-128	128	17.52	0.44
Sefltok	1024	26.30	0.81
Continuous Tokenizers
Stable Diffusion 1.5	32×32×4	25.18	0.73
Stable Diffusion XL	32×32×4	26.22	0.77
Stable Diffusion 3 Medium	32×32×16	30.00	0.88
Flux.1-dev	32×32×16	31.64	0.91
NextStep-1	32×32×16	30.60	0.89

Robustness of NextStep-1-f8ch16-Tokenizer

Impact of Noise Perturbation on Image Tokenizer Performance. The top panel displays quantitative metrics (rFID↓, PSNR↑, and SSIM↑) versus noise intensity. The bottom panel presents qualitative reconstruction examples at noise standard deviations of 0.2 and 0.5.

stepfun-ai
/

NextStep-1-f8ch16-Tokenizer

Improved Image Tokenizer

Usage

Evaluation

Reconstruction Performance on ImageNet-1K 256×256

Robustness of NextStep-1-f8ch16-Tokenizer

Collection including stepfun-ai/NextStep-1-f8ch16-Tokenizer

NextStep-1