File size: 6,158 Bytes
c5f6feb 97b57d6 c5f6feb cabb747 c5f6feb 138f295 97b57d6 c5f6feb cabb747 c5f6feb cabb747 c5f6feb 97b57d6 c5f6feb cabb747 0378ef1 cabb747 c5f6feb cabb747 c5f6feb cabb747 c5f6feb cabb747 c5f6feb cabb747 c5f6feb cabb747 c5f6feb cabb747 c5f6feb 97b57d6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
---
base_model:
- google/siglip2-so400m-patch16-384
- google/siglip2-so400m-patch16-256
language:
- en
license: other
license_name: other
license_link: https://github.com/TencentARC/TokLIP/blob/main/LICENSE
pipeline_tag: image-text-to-text
tags:
- Tokenizer
- CLIP
- UnifiedMLLM
---
# TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
<h5 align="center">
[](https://arxiv.org/abs/2505.05422)
[](https://github.com/TencentARC/TokLIP)
[](https://huggingface.co/TencentARC/TokLIP)
[](https://github.com/TencentARC/TokLIP/blob/main/LICENSE)
<br>
</h5>
Welcome to the official code repository for "[**TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation**](https://arxiv.org/abs/2505.05422)".
Your star means a lot to us in developing this project! ⭐⭐⭐
## 📰 News
* [2025/08/18] 🚀 Check our latest results on arXiv ([PDF](https://arxiv.org/pdf/2505.05422))!
* [2025/08/18] 🔥 We release TokLIP XL with 512 resolution [🤗 TokLIP_XL_512](https://huggingface.co/TencentARC/TokLIP/blob/main/TokLIP_XL_512.pt)!
* [2025/08/05] 🔥 We release the training code!
* [2025/06/05] 🔥 We release the code and models!
* [2025/05/09] 🚀 Our paper is available on arXiv!
## 👀 Introduction
<img src="https://raw.githubusercontent.com/TencentARC/TokLIP/main/docs/TokLIP.png" alt="TokLIP" style="zoom:50%;" />
- We introduce TokLIP, a visual tokenizer that enhances comprehension by **semanticizing** vector-quantized (VQ) tokens and **incorporating CLIP-level semantics** while enabling end-to-end multimodal autoregressive training with standard VQ tokens.
- TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics.
- Unlike previous approaches (e.g., VILA-U) that *discretize high-level features*, TokLIP **disentangles training objectives for comprehension and generation**, allowing the direct application of advanced VQ tokenizers without the need for tailored quantization operations.
## 🔧 Installation
```bash
conda create -n toklip python=3.10 -y
conda activate toklip
git clone https://github.com/TencentARC/TokLIP
pip install --upgrade pip
pip install -r requirements.txt
```
## ⚙️ Usage
### Model Weight
| Model | Resolution | VQGAN | IN Top1 | COCO TR@1 | COCO IR@1 | Weight |
| :-------: | :--------: | :----------------------------------------------------------: | :-----: | :-------: | :-------: | :----------------------------------------------------------: |
| TokLIP-S | 256 | [LlamaGen](https://huggingface.co/peizesun/llamagen_t2i/blob/main/vq_ds16_t2i.pt) | 76.4 | 64.06 | 48.46 | [🤗 TokLIP_S_256](https://huggingface.co/TencentARC/TokLIP/blob/main/TokLIP_S_256.pt) |
| TokLIP-L | 384 | [LlamaGen](https://huggingface.co/peizesun/llamagen_t2i/blob/main/vq_ds16_t2i.pt) | 80.0 | 68.00 | 52.87 | [🤗 TokLIP_L_384](https://huggingface.co/TencentARC/TokLIP/blob/main/TokLIP_L_384.pt) |
| TokLIP-XL | 512 | [IBQ](https://huggingface.co/TencentARC/IBQ-Tokenizer-262144/blob/main/imagenet256_262144.ckpt) | 80.8 | 69.40 | 53.77 | [🤗 TokLIP_XL_512](https://huggingface.co/TencentARC/TokLIP/blob/main/TokLIP_XL_512.pt) |
### Training
1. Please refer to [img2dataset](https://github.com/rom1504/img2dataset) to prepare the WebDataset required for training. You may choose datasets such as **CC3M**, **CC12M**, or **LAION**.
2. Prepare the teacher models using `src/covert.py`:
```bash
cd src
TIMM_MODEL='original' python covert.py --model_name 'ViT-SO400M-16-SigLIP2-256' --save_path './model/siglip2-so400m-vit-l16-256.pt'
TIMM_MODEL='original' python covert.py --model_name 'ViT-SO400M-16-SigLIP2-384' --save_path './model/siglip2-so400m-vit-l16-384.pt'
```
3. Train TokLIP using the scripts `src\train_toklip_256.sh` and `src\train_toklip_384.sh`. You need to set `--train-data` and `--train-num-samples` arguments accordingly.
### Evaluation
Please first download the TokLIP model weights.
We provide the evaluation scripts for ImageNet classification and MSCOCO Retrieval in `src\test_toklip_256.sh`, `src\test_toklip_384.sh`, and `src\test_toklip_512.sh`.
Please revise the `--pretrained`, `--imagenet-val`, and `--coco-dir` with your specific paths.
### Inference
We provide the inference example in `src/inference.py`.
```shell
cd src
python inference.py --model-config 'ViT-SO400M-16-SigLIP2-384-toklip' --pretrained 'YOUR_TOKLIP_PATH'
```
### Model Usage
We provide `build_toklip_encoder` function in `src/create_toklip.py`, you could directly load TokLIP with `model`, `image_size`, and `model_path` parameters.
## 🔜 TODOs
- [x] Release training codes.
- [x] Release TokLIP-XL with 512 resolution.
## 📂 Contact
If you have further questions, please open an issue or contact <[email protected]>.
Discussions and potential collaborations are also welcome.
## 🙏 Acknowledgement
This repo is built upon the following projects:
* [OpenCLIP](https://github.com/mlfoundations/open_clip)
* [LlamaGen](https://github.com/FoundationVision/LlamaGen)
* [DeCLIP](https://github.com/Sense-GVT/DeCLIP)
* [SEED-Voken](https://github.com/TencentARC/SEED-Voken)
We thank the authors for their codes.
## 📝 Citation
Please cite our work if you use our code or discuss our findings in your own research:
```bibtex
@article{lin2025toklip,
title={Toklip: Marry visual tokens to clip for multimodal comprehension and generation},
author={Lin, Haokun and Wang, Teng and Ge, Yixiao and Ge, Yuying and Lu, Zhichao and Wei, Ying and Zhang, Qingfu and Sun, Zhenan and Shan, Ying},
journal={arXiv preprint arXiv:2505.05422},
year={2025}
}
``` |