File size: 2,969 Bytes
773ac03 31c24c8 773ac03 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
---
license: apache-2.0
tags:
- NextStep
- Image Tokenizer
---
# Improved Image Tokenizer
This is an improved image tokenizer of NextStep-1, featuring a fine-tuned decoder with a frozen encoder. The decoder refinement **improves performance** while preserving robust reconstruction quality. We **recommend using this Image Tokenizer** for optimal results with NextStep-1 models.
## Usage
```py
import torch
from PIL import Image
import numpy as np
import torchvision.transforms as transforms
from modeling_flux_vae import AutoencoderKL
device = "cuda"
dtype = torch.bfloat16
model_path = "/path/to/vae_dir"
vae = AutoencoderKL.from_pretrained(model_path).to(device=device, dtype=dtype)
pil2tensor = transforms.Compose(
[
transforms.ToTensor(),
transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),
]
)
image = Image.open("/path/to/image.jpg")
pixel_values = pil2tensor(image).unsqueeze(0).to(device=device, dtype=dtype)
# encode
latents = vae.encode(pixel_values).latent_dist.sample()
# decode
sampled_images = vae.decode(latents).sample
sampled_images = sampled_images.detach().cpu().to(torch.float32)
def tensor_to_pil(tensor):
image = tensor.detach().cpu().to(torch.float32)
image = (image / 2 + 0.5).clamp(0, 1)
image = image.mul(255).round().to(dtype=torch.uint8)
image = image.permute(1, 2, 0).numpy()
return Image.fromarray(image, mode="RGB")
rec_image = tensor_to_pil(sampled_images[0])
rec_image.save("/path/to/output.jpg")
```
## Evaluation
### Reconstruction Performance on ImageNet-1K 256×256
| Tokenizer | Latent Shape | PSNR ↑ | SSIM ↑ |
| ------------------------- | ------------ | --------- | -------- |
| **Discrete Tokenizers** | | | |
| SBER-MoVQGAN (270M) | 32×32 | 27.04 | 0.74 |
| LlamaGen | 32×32 | 24.44 | 0.77 |
| VAR | 680 | 22.12 | 0.62 |
| TiTok-S-128 | 128 | 17.52 | 0.44 |
| Sefltok | 1024 | 26.30 | 0.81 |
| **Continuous Tokenizers** | | | |
| Stable Diffusion 1.5 | 32×32×4 | 25.18 | 0.73 |
| Stable Diffusion XL | 32×32×4 | 26.22 | 0.77 |
| Stable Diffusion 3 Medium | 32×32×16 | 30.00 | 0.88 |
| Flux.1-dev | 32×32×16 | 31.64 | 0.91 |
| **NextStep-1** | **32×32×16** | **30.60** | **0.89** |
### Robustness of NextStep-1-f8ch16-Tokenizer
Impact of Noise Perturbation on Image Tokenizer Performance. The top panel displays
quantitative metrics (rFID↓, PSNR↑, and SSIM↑) versus noise intensity. The bottom panel presents qualitative reconstruction examples at noise standard deviations of 0.2 and 0.5.
<div align='center'>
<img src="assets/robustness.png" class="interpolation-image" alt="arch." width="100%" />
</div> |