Felix1023 commited on
Commit
cabb747
·
verified ·
1 Parent(s): 6eb89bd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -16
README.md CHANGED
@@ -16,8 +16,7 @@ tags:
16
 
17
  # TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
18
 
19
-
20
- <h5 align="left">
21
 
22
  [![arXiv](https://img.shields.io/badge/TokLIP-2505.05422-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2505.05422)
23
  [![GitHub](https://img.shields.io/badge/GitHub-Code-green?logo=github)](https://github.com/TencentARC/TokLIP)
@@ -27,20 +26,22 @@ tags:
27
 
28
  </h5>
29
 
30
-
31
  Welcome to the official code repository for "[**TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation**](https://arxiv.org/abs/2505.05422)".
32
 
33
- Your star means a lot for us to develop this project! ⭐⭐⭐
34
 
35
 
36
  ## 📰 News
 
 
 
37
  * [2025/06/05] 🔥 We release the code and models!
38
  * [2025/05/09] 🚀 Our paper is available on arXiv!
39
 
40
 
41
  ## 👀 Introduction
42
 
43
- <img src="./TokLIP.png" alt="TokLIP" style="zoom:50%;" />
44
 
45
  - We introduce TokLIP, a visual tokenizer that enhances comprehension by **semanticizing** vector-quantized (VQ) tokens and **incorporating CLIP-level semantics** while enabling end-to-end multimodal autoregressive training with standard VQ tokens.
46
 
@@ -63,18 +64,31 @@ pip install -r requirements.txt
63
 
64
  ### Model Weight
65
 
66
- | Model | Resolution | IN Top1 | COCO TR@1 | COCO IR@1 | Weight |
67
- | :------: | :--------: | :-----: | :-------: | :-------: | :----------------------------------------------------------: |
68
- | TokLIP-S | 256 | 76.4 | 64.06 | 48.46 | [🤗 TokLIP_S_256](https://huggingface.co/TencentARC/TokLIP/blob/main/TokLIP_S_256.pt) |
69
- | TokLIP-L | 384 | 80.0 | 68.00 | 52.87 | [🤗 TokLIP_L_384](https://huggingface.co/TencentARC/TokLIP/blob/main/TokLIP_L_384.pt) |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
- We are current working on TokLIP-XL with 512x512 resolution and it will be released soon!
72
 
73
  ### Evaluation
74
 
75
  Please first download the TokLIP model weights.
76
 
77
- We provide the evalution scripts for ImageNet classification and MSCOCO Retrieval in `src\test_toklip_256.sh` and `src\test_toklip_384.sh`.
78
 
79
  Please revise the `--pretrained`, `--imagenet-val`, and `--coco-dir` with your specific paths.
80
 
@@ -89,12 +103,12 @@ python inference.py --model-config 'ViT-SO400M-16-SigLIP2-384-toklip' --pretrain
89
 
90
  ### Model Usage
91
 
92
- We provide `build_toklip_encoder` function in `src/create_toklip.py`, you could direct load TokLIP with `model`, `image_size`, and `model_path` parameters.
93
 
94
 
95
  ## 🔜 TODOs
96
- - [ ] Release training codes.
97
- - [ ] Release TokLIP-XL with 512 resolution.
98
 
99
 
100
  ## 📂 Contact
@@ -104,11 +118,12 @@ Discussions and potential collaborations are also welcome.
104
 
105
 
106
  ## 🙏 Acknowledgement
107
- This repo is build upon the following projects:
108
 
109
  * [OpenCLIP](https://github.com/mlfoundations/open_clip)
110
  * [LlamaGen](https://github.com/FoundationVision/LlamaGen)
111
  * [DeCLIP](https://github.com/Sense-GVT/DeCLIP)
 
112
 
113
  We thank the authors for their codes.
114
 
@@ -117,7 +132,7 @@ We thank the authors for their codes.
117
  Please cite our work if you use our code or discuss our findings in your own research:
118
  ```bibtex
119
  @article{lin2025toklip,
120
- title={TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation},
121
  author={Lin, Haokun and Wang, Teng and Ge, Yixiao and Ge, Yuying and Lu, Zhichao and Wei, Ying and Zhang, Qingfu and Sun, Zhenan and Shan, Ying},
122
  journal={arXiv preprint arXiv:2505.05422},
123
  year={2025}
 
16
 
17
  # TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
18
 
19
+ <h5 align="center">
 
20
 
21
  [![arXiv](https://img.shields.io/badge/TokLIP-2505.05422-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2505.05422)
22
  [![GitHub](https://img.shields.io/badge/GitHub-Code-green?logo=github)](https://github.com/TencentARC/TokLIP)
 
26
 
27
  </h5>
28
 
 
29
  Welcome to the official code repository for "[**TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation**](https://arxiv.org/abs/2505.05422)".
30
 
31
+ Your star means a lot to us in developing this project! ⭐⭐⭐
32
 
33
 
34
  ## 📰 News
35
+ * [2025/08/18] 🚀 Check our latest results on arXiv ([PDF](https://arxiv.org/pdf/2505.05422))!
36
+ * [2025/08/18] 🔥 We release TokLIP XL with 512 resolution [🤗 TokLIP_XL_512](https://huggingface.co/TencentARC/TokLIP/blob/main/TokLIP_XL_512.pt)!
37
+ * [2025/08/05] 🔥 We release the training code!
38
  * [2025/06/05] 🔥 We release the code and models!
39
  * [2025/05/09] 🚀 Our paper is available on arXiv!
40
 
41
 
42
  ## 👀 Introduction
43
 
44
+ <img src="./docs//TokLIP.png" alt="TokLIP" style="zoom:50%;" />
45
 
46
  - We introduce TokLIP, a visual tokenizer that enhances comprehension by **semanticizing** vector-quantized (VQ) tokens and **incorporating CLIP-level semantics** while enabling end-to-end multimodal autoregressive training with standard VQ tokens.
47
 
 
64
 
65
  ### Model Weight
66
 
67
+ | Model | Resolution | VQGAN | IN Top1 | COCO TR@1 | COCO IR@1 | Weight |
68
+ | :-------: | :--------: | :----------------------------------------------------------: | :-----: | :-------: | :-------: | :----------------------------------------------------------: |
69
+ | TokLIP-S | 256 | [LlamaGen](https://huggingface.co/peizesun/llamagen_t2i/blob/main/vq_ds16_t2i.pt) | 76.4 | 64.06 | 48.46 | [🤗 TokLIP_S_256](https://huggingface.co/TencentARC/TokLIP/blob/main/TokLIP_S_256.pt) |
70
+ | TokLIP-L | 384 | [LlamaGen](https://huggingface.co/peizesun/llamagen_t2i/blob/main/vq_ds16_t2i.pt) | 80.0 | 68.00 | 52.87 | [🤗 TokLIP_L_384](https://huggingface.co/TencentARC/TokLIP/blob/main/TokLIP_L_384.pt) |
71
+ | TokLIP-XL | 512 | [IBQ](https://huggingface.co/TencentARC/IBQ-Tokenizer-262144/blob/main/imagenet256_262144.ckpt) | 80.8 | 69.36 | 53.79 | [🤗 TokLIP_XL_512](https://huggingface.co/TencentARC/TokLIP/blob/main/TokLIP_XL_512.pt) |
72
+
73
+
74
+ ### Training
75
+
76
+ 1. Please refer to [img2dataset](https://github.com/rom1504/img2dataset) to prepare the WebDataset required for training. You may choose datasets such as **CC3M**, **CC12M**, or **LAION**.
77
+
78
+ 2. Prepare the teacher models using `src/covert.py`:
79
+ ```bash
80
+ cd src
81
+ TIMM_MODEL='original' python covert.py --model_name 'ViT-SO400M-16-SigLIP2-256' --save_path './model/siglip2-so400m-vit-l16-256.pt'
82
+ TIMM_MODEL='original' python covert.py --model_name 'ViT-SO400M-16-SigLIP2-384' --save_path './model/siglip2-so400m-vit-l16-384.pt'
83
+ ```
84
+ 3. Train TokLIP using the scripts `src\train_toklip_256.sh` and `src\train_toklip_384.sh`. You need to set `--train-data` and `--train-num-samples` arguments accordingly.
85
 
 
86
 
87
  ### Evaluation
88
 
89
  Please first download the TokLIP model weights.
90
 
91
+ We provide the evaluation scripts for ImageNet classification and MSCOCO Retrieval in `src\test_toklip_256.sh`, `src\test_toklip_384.sh`, and `src\test_toklip_512.sh`.
92
 
93
  Please revise the `--pretrained`, `--imagenet-val`, and `--coco-dir` with your specific paths.
94
 
 
103
 
104
  ### Model Usage
105
 
106
+ We provide `build_toklip_encoder` function in `src/create_toklip.py`, you could directly load TokLIP with `model`, `image_size`, and `model_path` parameters.
107
 
108
 
109
  ## 🔜 TODOs
110
+ - [x] Release training codes.
111
+ - [x] Release TokLIP-XL with 512 resolution.
112
 
113
 
114
  ## 📂 Contact
 
118
 
119
 
120
  ## 🙏 Acknowledgement
121
+ This repo is built upon the following projects:
122
 
123
  * [OpenCLIP](https://github.com/mlfoundations/open_clip)
124
  * [LlamaGen](https://github.com/FoundationVision/LlamaGen)
125
  * [DeCLIP](https://github.com/Sense-GVT/DeCLIP)
126
+ * [SEED-Voken](https://github.com/TencentARC/SEED-Voken)
127
 
128
  We thank the authors for their codes.
129
 
 
132
  Please cite our work if you use our code or discuss our findings in your own research:
133
  ```bibtex
134
  @article{lin2025toklip,
135
+ title={Toklip: Marry visual tokens to clip for multimodal comprehension and generation},
136
  author={Lin, Haokun and Wang, Teng and Ge, Yixiao and Ge, Yuying and Lu, Zhichao and Wei, Ying and Zhang, Qingfu and Sun, Zhenan and Shan, Ying},
137
  journal={arXiv preprint arXiv:2505.05422},
138
  year={2025}