Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,63 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
base_model:
|
4 |
+
- google/siglip-so400m-patch14-384
|
5 |
+
pipeline_tag: image-feature-extraction
|
6 |
+
---
|
7 |
+
# Model Card for CoMP-MM-1B
|
8 |
+
|
9 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
10 |
+
This is an VFM that supports <b>native image resolution inputs</b>, continually pre-trained from [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384).
|
11 |
+
|
12 |
+
## Model Sources
|
13 |
+
|
14 |
+
<!-- Provide the basic links for the model. -->
|
15 |
+
|
16 |
+
- **Repository:** [https://github.com/SliMM-X/CoMP-MM]
|
17 |
+
- **Paper:** [https://arxiv.org/abs/2503.18931]
|
18 |
+
|
19 |
+
|
20 |
+
## How to Get Started with the Model
|
21 |
+
|
22 |
+
Install the github repo, and use the code below to get started with the model.
|
23 |
+
|
24 |
+
```python
|
25 |
+
import torch
|
26 |
+
from slimm.model.processor import SliMMQwen2VLProcessor
|
27 |
+
from slimm.model.utils_vl import process_vision_info
|
28 |
+
from slimm.model.vision_encoder import CoMPSiglipVisionModel
|
29 |
+
from PIL import Image
|
30 |
+
|
31 |
+
model_path = "SliMM-X/CoMP-SigLIP-So400M"
|
32 |
+
|
33 |
+
model = CoMPSiglipVisionModel.from_pretrained(
|
34 |
+
model_path, torch_dtype="auto", device_map="cuda", w_merger=False
|
35 |
+
).to(torch.bfloat16)
|
36 |
+
|
37 |
+
|
38 |
+
processor = SliMMQwen2VLProcessor.from_pretrained(model_path)
|
39 |
+
|
40 |
+
image_input = Image.open("https://slimm-x.github.io/comp/figs/teaser.png")
|
41 |
+
inputs = processor(
|
42 |
+
images=image_input,
|
43 |
+
return_tensors="pt",
|
44 |
+
)
|
45 |
+
|
46 |
+
inputs = inputs.to("cuda")
|
47 |
+
output_feat = model(inputs.pixel_values.to(torch.bfloat16), inputs.image_grid_thw)
|
48 |
+
print(output_feat)
|
49 |
+
```
|
50 |
+
|
51 |
+
## Citation
|
52 |
+
|
53 |
+
|
54 |
+
**BibTeX:**
|
55 |
+
|
56 |
+
```bibtex
|
57 |
+
@article{comp2025,
|
58 |
+
title={CoMP: Continual Multimodal Pre-training for Vision Foundation Models},
|
59 |
+
author={Chen, Yitong and Meng, Lingchen and Peng, Wujian and Wu, Zuxuan and Jiang, Yu-Gang},
|
60 |
+
year={2025},
|
61 |
+
journal={arXiv preprint arXiv:2503.18931},
|
62 |
+
}
|
63 |
+
```
|