Unconditional Image Generation
Fairseq

Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective

arXiv benchmark

This is the official implementation of DiGIT (Github) accepted at NeurIPS 2024. The code will be available soon.

Overview

We present DiGIT, an auto-regressive generative model performing next-token prediction in an abstract latent space derived from self-supervised learning (SSL) models. By employing K-Means clustering on the hidden states of the DINOv2 model, we effectively create a novel discrete tokenizer. This method significantly boosts image generation performance on ImageNet dataset, achieving an FID score of 4.59 for class-unconditional tasks and 3.39 for class-conditional tasks. Additionally, the model enhances image understanding, attaining a linear-probe accuracy of 80.3.

Experimental Results

Linear-Probe Accuracy on ImageNet

Methods # Tokens Features # Params Top-1 Acc. $\uparrow$
iGPT-L 32 $\times$ 32 1536 1362M 60.3
iGPT-XL 64 $\times$ 64 3072 6801M 68.7
VIM+VQGAN 32 $\times$ 32 1024 650M 61.8
VIM+dVAE 32 $\times$ 32 1024 650M 63.8
VIM+ViT-VQGAN 32 $\times$ 32 1024 650M 65.1
VIM+ViT-VQGAN 32 $\times$ 32 2048 1697M 73.2
AIM 16 $\times$ 16 1536 0.6B 70.5
DiGIT (Ours) 16 $\times$ 16 1024 219M 71.7
DiGIT (Ours) 16 $\times$ 16 1536 732M 80.3

Class-Unconditional Image Generation on ImageNet (Resolution: 256 $\times$ 256)

Type Methods # Param # Epoch FID $\downarrow$ IS $\uparrow$
GAN BigGAN 70M - 38.6 24.70
Diff. LDM 395M - 39.1 22.83
Diff. ADM 554M - 26.2 39.70
MIM MAGE 200M 1600 11.1 81.17
MIM MAGE 463M 1600 9.10 105.1
MIM MaskGIT 227M 300 20.7 42.08
MIM DiGIT (+MaskGIT) 219M 200 9.04 75.04
AR VQGAN 214M 200 24.38 30.93
AR DiGIT (+VQGAN) 219M 400 9.13 73.85
AR DiGIT (+VQGAN) 732M 200 4.59 141.29

Class-Conditional Image Generation on ImageNet (Resolution: 256 $\times$ 256)

Type Methods # Param # Epoch FID $\downarrow$ IS $\uparrow$
GAN BigGAN 160M - 6.95 198.2
Diff. ADM 554M - 10.94 101.0
Diff. LDM-4 400M - 10.56 103.5
Diff. DiT-XL/2 675M - 9.62 121.50
Diff. L-DiT-7B 7B - 6.09 153.32
MIM CQR-Trans 371M 300 5.45 172.6
MIM+AR VAR 310M 200 4.64 -
MIM+AR VAR 310M 200 3.60* 257.5*
MIM+AR VAR 600M 250 2.95* 306.1*
MIM MAGVIT-v2 307M 1080 3.65 200.5
AR VQVAE-2 13.5B - 31.11 45
AR RQ-Trans 480M - 15.72 86.8
AR RQ-Trans 3.8B - 7.55 134.0
AR ViTVQGAN 650M 360 11.20 97.2
AR ViTVQGAN 1.7B 360 5.3 149.9
MIM MaskGIT 227M 300 6.18 182.1
MIM DiGIT (+MaskGIT) 219M 200 4.62 146.19
AR VQGAN 227M 300 18.65 80.4
AR DiGIT (+VQGAN) 219M 400 4.79 142.87
AR DiGIT (+VQGAN) 732M 200 3.39 205.96

*: VAR is trained with classifier-free guidance while all the other models are not.

Checkpoints

The K-Means npy file and model checkpoints can be downloaded from:

Model Link
HF weightsπŸ€— Huggingface

For the base model we use DINOv2-base and DINOv2-large for large size model. The VQGAN we use is the same as MAGE.

DiGIT
└── data/
    β”œβ”€β”€ ILSVRC2012
        β”œβ”€β”€ dinov2_base_short_224_l3
            β”œβ”€β”€ km_8k.npy
        β”œβ”€β”€ dinov2_large_short_224_l3
            β”œβ”€β”€ km_16k.npy
└── outputs/
    β”œβ”€β”€ base_8k_stage1
    β”œβ”€β”€ ...
└── models/
    β”œβ”€β”€ vqgan_jax_strongaug.ckpt
    β”œβ”€β”€ dinov2_vitb14_reg4_pretrain.pth
    β”œβ”€β”€ dinov2_vitl14_reg4_pretrain.pth

The training and inference code can be found at our github repo https://github.com/DAMO-NLP-SG/DiGIT

Citation

If you find our project useful, hope you can star our repo and cite our work as follows.

@misc{zhu2024stabilize,
    title={Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective},
    author={Yongxin Zhu and Bocheng Li and Hang Zhang and Xin Li and Linli Xu and Lidong Bing},
    year={2024},
    eprint={2410.12490},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support unconditional-image-generation models for fairseq library.

Dataset used to train DAMO-NLP-SG/DiGIT

Collection including DAMO-NLP-SG/DiGIT