Felix1023 commited on
Commit
c5f6feb
·
verified ·
1 Parent(s): 7d977c9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +125 -5
README.md CHANGED
@@ -1,5 +1,125 @@
1
- ---
2
- license: other
3
- license_name: other
4
- license_link: https://github.com/TencentARC/TokLIP/blob/main/LICENSE
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: other
4
+ license_link: https://github.com/TencentARC/TokLIP/blob/main/LICENSE
5
+ language:
6
+ - en
7
+ base_model:
8
+ - google/siglip2-so400m-patch16-384
9
+ - google/siglip2-so400m-patch16-256
10
+ tags:
11
+ - Tokenizer
12
+ - CLIP
13
+ - UnifiedMLLM
14
+ ---
15
+
16
+
17
+ # TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
18
+
19
+
20
+ <h5 align="left">
21
+
22
+ [![arXiv](https://img.shields.io/badge/TokLIP-2505.05422-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2505.05422)
23
+ [![GitHub](https://img.shields.io/badge/GitHub-Code-green?logo=github)](https://github.com/TencentARC/VideoPainter)
24
+ [![HuggingFace](https://img.shields.io/badge/🤗%20Model-Huggingface-yellow)](https://huggingface.co/TencentARC/TokLIP)
25
+ [![License](https://img.shields.io/badge/⚖️%20Code%20License-Other-blue)](https://github.com/TencentARC/TokLIP/blob/main/LICENSE)
26
+ <br>
27
+
28
+ </h5>
29
+
30
+
31
+ Welcome to the official code repository for "[**TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation**](https://arxiv.org/abs/2505.05422)".
32
+
33
+ Your star means a lot for us to develop this project! ⭐⭐⭐
34
+
35
+
36
+ ## 📰 News
37
+ * [2025/06/05] 🔥 We release the code and models!
38
+ * [2025/05/09] 🚀 Our paper is available on arXiv!
39
+
40
+
41
+ ## 👀 Introduction
42
+
43
+ <img src="./docs//TokLIP.png" alt="TokLIP" style="zoom:50%;" />
44
+
45
+ - We introduce TokLIP, a visual tokenizer that enhances comprehension by **semanticizing** vector-quantized (VQ) tokens and **incorporating CLIP-level semantics** while enabling end-to-end multimodal autoregressive training with standard VQ tokens.
46
+
47
+ - TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics.
48
+
49
+ - Unlike previous approaches (e.g., VILA-U) that *discretize high-level features*, TokLIP **disentangles training objectives for comprehension and generation**, allowing the direct application of advanced VQ tokenizers without the need for tailored quantization operations.
50
+
51
+
52
+ ## 🔧 Installation
53
+ ```bash
54
+ conda create -n toklip python=3.10 -y
55
+ conda activate toklip
56
+ git clone https://github.com/TencentARC/TokLIP
57
+ pip install --upgrade pip
58
+ pip install -r requirements.txt
59
+ ```
60
+
61
+
62
+ ## ⚙️ Usage
63
+
64
+ ### Model Weight
65
+
66
+ | Model | Resolution | IN Top1 | COCO TR@1 | COCO IR@1 | Weight |
67
+ | :------: | :--------: | :-----: | :-------: | :-------: | :----------------------------------------------------------: |
68
+ | TokLIP-S | 256 | 76.4 | 64.06 | 48.46 | [🤗 TokLIP_S_256](https://huggingface.co/TencentARC/TokLIP/blob/main/TokLIP_S_256.pt) |
69
+ | TokLIP-L | 384 | 80.0 | 68.00 | 52.87 | [🤗 TokLIP_L_384](https://huggingface.co/TencentARC/TokLIP/blob/main/TokLIP_L_384.pt) |
70
+
71
+ We are current working on TokLIP-XL with 512x512 resolution and it will be released soon!
72
+
73
+ ### Evaluation
74
+
75
+ Please first download the TokLIP model weights.
76
+
77
+ We provide the evalution scripts for ImageNet classification and MSCOCO Retrieval in `src\test_toklip_256.sh` and `src\test_toklip_384.sh`.
78
+
79
+ Please revise the `--pretrained`, `--imagenet-val`, and `--coco-dir` with your specific paths.
80
+
81
+ ### Inference
82
+
83
+ We provide the inference example in `src/inference.py`.
84
+
85
+ ```shell
86
+ cd src
87
+ python inference.py --model-config 'ViT-SO400M-16-SigLIP2-384-toklip' --pretrained 'YOUR_TOKLIP_PATH'
88
+ ```
89
+
90
+ ### Model Usage
91
+
92
+ We provide `build_toklip_encoder` function in `src/create_toklip.py`, you could direct load TokLIP with `model`, `image_size`, and `model_path` parameters.
93
+
94
+
95
+ ## 🔜 TODOs
96
+ - [ ] Release training codes.
97
+ - [ ] Release TokLIP-XL with 512 resolution.
98
+
99
+
100
+ ## 📂 Contact
101
+ If you have further questions, please open an issue or contact <[email protected]>.
102
+
103
+ Discussions and potential collaborations are also welcome.
104
+
105
+
106
+ ## 🙏 Acknowledgement
107
+ This repo is build upon the following projects:
108
+
109
+ * [OpenCLIP](https://github.com/mlfoundations/open_clip)
110
+ * [LlamaGen](https://github.com/FoundationVision/LlamaGen)
111
+ * [DeCLIP](https://github.com/Sense-GVT/DeCLIP)
112
+
113
+ We thank the authors for their codes.
114
+
115
+
116
+ ## 📝 Citation
117
+ Please cite our work if you use our code or discuss our findings in your own research:
118
+ ```bibtex
119
+ @article{lin2025toklip,
120
+ title={TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation},
121
+ author={Lin, Haokun and Wang, Teng and Ge, Yixiao and Ge, Yuying and Lu, Zhichao and Wei, Ying and Zhang, Qingfu and Sun, Zhenan and Shan, Ying},
122
+ journal={arXiv preprint arXiv:2505.05422},
123
+ year={2025}
124
+ }
125
+ ```