File size: 2,969 Bytes
773ac03
 
 
 
 
 
 
 
 
 
31c24c8
773ac03
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
---
license: apache-2.0
tags:
- NextStep
- Image Tokenizer
---
# Improved Image Tokenizer

This is an improved image tokenizer of NextStep-1,  featuring a fine-tuned decoder with a frozen encoder. The decoder refinement **improves performance** while preserving robust reconstruction quality. We **recommend using this Image Tokenizer** for optimal results with NextStep-1 models.

## Usage

```py
import torch
from PIL import Image
import numpy as np
import torchvision.transforms as transforms

from modeling_flux_vae import AutoencoderKL

device = "cuda"
dtype = torch.bfloat16

model_path = "/path/to/vae_dir"
vae = AutoencoderKL.from_pretrained(model_path).to(device=device, dtype=dtype)

pil2tensor = transforms.Compose(
    [
        transforms.ToTensor(),
        transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),
    ]
)

image = Image.open("/path/to/image.jpg")
pixel_values = pil2tensor(image).unsqueeze(0).to(device=device, dtype=dtype)

# encode
latents = vae.encode(pixel_values).latent_dist.sample()

# decode
sampled_images = vae.decode(latents).sample
sampled_images = sampled_images.detach().cpu().to(torch.float32)

def tensor_to_pil(tensor):
    image = tensor.detach().cpu().to(torch.float32)
    image = (image / 2 + 0.5).clamp(0, 1)
    image = image.mul(255).round().to(dtype=torch.uint8)
    image = image.permute(1, 2, 0).numpy()
    return Image.fromarray(image, mode="RGB")

rec_image = tensor_to_pil(sampled_images[0])
rec_image.save("/path/to/output.jpg")
```

## Evaluation

### Reconstruction Performance on ImageNet-1K 256×256

| Tokenizer                 | Latent Shape | PSNR ↑    | SSIM ↑   |
| ------------------------- | ------------ | --------- | -------- |
| **Discrete Tokenizers**   |              |           |          |
| SBER-MoVQGAN (270M)       | 32×32        | 27.04     | 0.74     |
| LlamaGen                  | 32×32        | 24.44     | 0.77     |
| VAR                       | 680          | 22.12     | 0.62     |
| TiTok-S-128               | 128          | 17.52     | 0.44     |
| Sefltok                   | 1024         | 26.30     | 0.81     |
| **Continuous Tokenizers** |              |           |          |
| Stable Diffusion 1.5      | 32×32×4      | 25.18     | 0.73     |
| Stable Diffusion XL       | 32×32×4      | 26.22     | 0.77     |
| Stable Diffusion 3 Medium | 32×32×16     | 30.00     | 0.88     |
| Flux.1-dev                | 32×32×16     | 31.64     | 0.91     |
| **NextStep-1**            | **32×32×16** | **30.60** | **0.89** |

### Robustness of NextStep-1-f8ch16-Tokenizer

Impact of Noise Perturbation on Image Tokenizer Performance. The top panel displays
quantitative metrics (rFID↓, PSNR↑, and SSIM↑) versus noise intensity. The bottom panel presents qualitative reconstruction examples at noise standard deviations of 0.2 and 0.5.

<div align='center'>
<img src="assets/robustness.png" class="interpolation-image" alt="arch." width="100%" />
</div>