Update README.md
Browse files
README.md
CHANGED
@@ -16,8 +16,7 @@ tags:
|
|
16 |
|
17 |
# TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
|
18 |
|
19 |
-
|
20 |
-
<h5 align="left">
|
21 |
|
22 |
[](https://arxiv.org/abs/2505.05422)
|
23 |
[](https://github.com/TencentARC/TokLIP)
|
@@ -27,20 +26,22 @@ tags:
|
|
27 |
|
28 |
</h5>
|
29 |
|
30 |
-
|
31 |
Welcome to the official code repository for "[**TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation**](https://arxiv.org/abs/2505.05422)".
|
32 |
|
33 |
-
Your star means a lot
|
34 |
|
35 |
|
36 |
## 📰 News
|
|
|
|
|
|
|
37 |
* [2025/06/05] 🔥 We release the code and models!
|
38 |
* [2025/05/09] 🚀 Our paper is available on arXiv!
|
39 |
|
40 |
|
41 |
## 👀 Introduction
|
42 |
|
43 |
-
<img src="./TokLIP.png" alt="TokLIP" style="zoom:50%;" />
|
44 |
|
45 |
- We introduce TokLIP, a visual tokenizer that enhances comprehension by **semanticizing** vector-quantized (VQ) tokens and **incorporating CLIP-level semantics** while enabling end-to-end multimodal autoregressive training with standard VQ tokens.
|
46 |
|
@@ -63,18 +64,31 @@ pip install -r requirements.txt
|
|
63 |
|
64 |
### Model Weight
|
65 |
|
66 |
-
|
|
67 |
-
|
|
68 |
-
| TokLIP-S
|
69 |
-
| TokLIP-L
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
70 |
|
71 |
-
We are current working on TokLIP-XL with 512x512 resolution and it will be released soon!
|
72 |
|
73 |
### Evaluation
|
74 |
|
75 |
Please first download the TokLIP model weights.
|
76 |
|
77 |
-
We provide the
|
78 |
|
79 |
Please revise the `--pretrained`, `--imagenet-val`, and `--coco-dir` with your specific paths.
|
80 |
|
@@ -89,12 +103,12 @@ python inference.py --model-config 'ViT-SO400M-16-SigLIP2-384-toklip' --pretrain
|
|
89 |
|
90 |
### Model Usage
|
91 |
|
92 |
-
We provide `build_toklip_encoder` function in `src/create_toklip.py`, you could
|
93 |
|
94 |
|
95 |
## 🔜 TODOs
|
96 |
-
- [
|
97 |
-
- [
|
98 |
|
99 |
|
100 |
## 📂 Contact
|
@@ -104,11 +118,12 @@ Discussions and potential collaborations are also welcome.
|
|
104 |
|
105 |
|
106 |
## 🙏 Acknowledgement
|
107 |
-
This repo is
|
108 |
|
109 |
* [OpenCLIP](https://github.com/mlfoundations/open_clip)
|
110 |
* [LlamaGen](https://github.com/FoundationVision/LlamaGen)
|
111 |
* [DeCLIP](https://github.com/Sense-GVT/DeCLIP)
|
|
|
112 |
|
113 |
We thank the authors for their codes.
|
114 |
|
@@ -117,7 +132,7 @@ We thank the authors for their codes.
|
|
117 |
Please cite our work if you use our code or discuss our findings in your own research:
|
118 |
```bibtex
|
119 |
@article{lin2025toklip,
|
120 |
-
title={
|
121 |
author={Lin, Haokun and Wang, Teng and Ge, Yixiao and Ge, Yuying and Lu, Zhichao and Wei, Ying and Zhang, Qingfu and Sun, Zhenan and Shan, Ying},
|
122 |
journal={arXiv preprint arXiv:2505.05422},
|
123 |
year={2025}
|
|
|
16 |
|
17 |
# TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
|
18 |
|
19 |
+
<h5 align="center">
|
|
|
20 |
|
21 |
[](https://arxiv.org/abs/2505.05422)
|
22 |
[](https://github.com/TencentARC/TokLIP)
|
|
|
26 |
|
27 |
</h5>
|
28 |
|
|
|
29 |
Welcome to the official code repository for "[**TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation**](https://arxiv.org/abs/2505.05422)".
|
30 |
|
31 |
+
Your star means a lot to us in developing this project! ⭐⭐⭐
|
32 |
|
33 |
|
34 |
## 📰 News
|
35 |
+
* [2025/08/18] 🚀 Check our latest results on arXiv ([PDF](https://arxiv.org/pdf/2505.05422))!
|
36 |
+
* [2025/08/18] 🔥 We release TokLIP XL with 512 resolution [🤗 TokLIP_XL_512](https://huggingface.co/TencentARC/TokLIP/blob/main/TokLIP_XL_512.pt)!
|
37 |
+
* [2025/08/05] 🔥 We release the training code!
|
38 |
* [2025/06/05] 🔥 We release the code and models!
|
39 |
* [2025/05/09] 🚀 Our paper is available on arXiv!
|
40 |
|
41 |
|
42 |
## 👀 Introduction
|
43 |
|
44 |
+
<img src="./docs//TokLIP.png" alt="TokLIP" style="zoom:50%;" />
|
45 |
|
46 |
- We introduce TokLIP, a visual tokenizer that enhances comprehension by **semanticizing** vector-quantized (VQ) tokens and **incorporating CLIP-level semantics** while enabling end-to-end multimodal autoregressive training with standard VQ tokens.
|
47 |
|
|
|
64 |
|
65 |
### Model Weight
|
66 |
|
67 |
+
| Model | Resolution | VQGAN | IN Top1 | COCO TR@1 | COCO IR@1 | Weight |
|
68 |
+
| :-------: | :--------: | :----------------------------------------------------------: | :-----: | :-------: | :-------: | :----------------------------------------------------------: |
|
69 |
+
| TokLIP-S | 256 | [LlamaGen](https://huggingface.co/peizesun/llamagen_t2i/blob/main/vq_ds16_t2i.pt) | 76.4 | 64.06 | 48.46 | [🤗 TokLIP_S_256](https://huggingface.co/TencentARC/TokLIP/blob/main/TokLIP_S_256.pt) |
|
70 |
+
| TokLIP-L | 384 | [LlamaGen](https://huggingface.co/peizesun/llamagen_t2i/blob/main/vq_ds16_t2i.pt) | 80.0 | 68.00 | 52.87 | [🤗 TokLIP_L_384](https://huggingface.co/TencentARC/TokLIP/blob/main/TokLIP_L_384.pt) |
|
71 |
+
| TokLIP-XL | 512 | [IBQ](https://huggingface.co/TencentARC/IBQ-Tokenizer-262144/blob/main/imagenet256_262144.ckpt) | 80.8 | 69.36 | 53.79 | [🤗 TokLIP_XL_512](https://huggingface.co/TencentARC/TokLIP/blob/main/TokLIP_XL_512.pt) |
|
72 |
+
|
73 |
+
|
74 |
+
### Training
|
75 |
+
|
76 |
+
1. Please refer to [img2dataset](https://github.com/rom1504/img2dataset) to prepare the WebDataset required for training. You may choose datasets such as **CC3M**, **CC12M**, or **LAION**.
|
77 |
+
|
78 |
+
2. Prepare the teacher models using `src/covert.py`:
|
79 |
+
```bash
|
80 |
+
cd src
|
81 |
+
TIMM_MODEL='original' python covert.py --model_name 'ViT-SO400M-16-SigLIP2-256' --save_path './model/siglip2-so400m-vit-l16-256.pt'
|
82 |
+
TIMM_MODEL='original' python covert.py --model_name 'ViT-SO400M-16-SigLIP2-384' --save_path './model/siglip2-so400m-vit-l16-384.pt'
|
83 |
+
```
|
84 |
+
3. Train TokLIP using the scripts `src\train_toklip_256.sh` and `src\train_toklip_384.sh`. You need to set `--train-data` and `--train-num-samples` arguments accordingly.
|
85 |
|
|
|
86 |
|
87 |
### Evaluation
|
88 |
|
89 |
Please first download the TokLIP model weights.
|
90 |
|
91 |
+
We provide the evaluation scripts for ImageNet classification and MSCOCO Retrieval in `src\test_toklip_256.sh`, `src\test_toklip_384.sh`, and `src\test_toklip_512.sh`.
|
92 |
|
93 |
Please revise the `--pretrained`, `--imagenet-val`, and `--coco-dir` with your specific paths.
|
94 |
|
|
|
103 |
|
104 |
### Model Usage
|
105 |
|
106 |
+
We provide `build_toklip_encoder` function in `src/create_toklip.py`, you could directly load TokLIP with `model`, `image_size`, and `model_path` parameters.
|
107 |
|
108 |
|
109 |
## 🔜 TODOs
|
110 |
+
- [x] Release training codes.
|
111 |
+
- [x] Release TokLIP-XL with 512 resolution.
|
112 |
|
113 |
|
114 |
## 📂 Contact
|
|
|
118 |
|
119 |
|
120 |
## 🙏 Acknowledgement
|
121 |
+
This repo is built upon the following projects:
|
122 |
|
123 |
* [OpenCLIP](https://github.com/mlfoundations/open_clip)
|
124 |
* [LlamaGen](https://github.com/FoundationVision/LlamaGen)
|
125 |
* [DeCLIP](https://github.com/Sense-GVT/DeCLIP)
|
126 |
+
* [SEED-Voken](https://github.com/TencentARC/SEED-Voken)
|
127 |
|
128 |
We thank the authors for their codes.
|
129 |
|
|
|
132 |
Please cite our work if you use our code or discuss our findings in your own research:
|
133 |
```bibtex
|
134 |
@article{lin2025toklip,
|
135 |
+
title={Toklip: Marry visual tokens to clip for multimodal comprehension and generation},
|
136 |
author={Lin, Haokun and Wang, Teng and Ge, Yixiao and Ge, Yuying and Lu, Zhichao and Wei, Ying and Zhang, Qingfu and Sun, Zhenan and Shan, Ying},
|
137 |
journal={arXiv preprint arXiv:2505.05422},
|
138 |
year={2025}
|