|
|
---
|
|
|
base_model:
|
|
|
- Qwen/Qwen2.5-0.5B-Instruct
|
|
|
license: apache-2.0
|
|
|
pipeline_tag: image-text-to-text
|
|
|
library_name: slimm
|
|
|
language:
|
|
|
- zho
|
|
|
- eng
|
|
|
- fra
|
|
|
- spa
|
|
|
- por
|
|
|
- deu
|
|
|
- ita
|
|
|
- rus
|
|
|
- jpn
|
|
|
- kor
|
|
|
- vie
|
|
|
- tha
|
|
|
- ara
|
|
|
---
|
|
|
|
|
|
# Model Card for CoMP-MM-1B
|
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. -->
|
|
|
This is an LMM that supports **native image resolution inputs**, composed of [CoMP-SigLIP](https://huggingface.co/SliMM-X/CoMP-SigLIP-So400M) and [Qwen2.5](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct).
|
|
|
|
|
|
## Model Sources
|
|
|
|
|
|
<!-- Provide the basic links for the model. -->
|
|
|
|
|
|
- **Repository:** https://github.com/SliMM-X/CoMP-MM
|
|
|
- **Paper:** https://arxiv.org/abs/2503.18931
|
|
|
- **Project Page:** https://slimm-x.github.io/comp
|
|
|
|
|
|
## How to Get Started with the Model
|
|
|
|
|
|
Install the github repo, and use the code below to get started with the model.
|
|
|
|
|
|
```python
|
|
|
# this is very similar to qwen2-vl
|
|
|
from slimm.model.processor import SliMMQwen2VLProcessor
|
|
|
from slimm.model.slimm import SliMMForConditionalGeneration
|
|
|
from slimm.model.utils_vl import process_vision_info
|
|
|
|
|
|
model_path = "SliMM-X/CoMP-MM-1B"
|
|
|
|
|
|
model = SliMMForConditionalGeneration.from_pretrained(
|
|
|
model_path, torch_dtype="auto", device_map="cuda"
|
|
|
)
|
|
|
processor = SliMMQwen2VLProcessor.from_pretrained(model_path)
|
|
|
|
|
|
messages = [
|
|
|
{
|
|
|
"role": "user",
|
|
|
"content": [
|
|
|
{
|
|
|
"type": "image",
|
|
|
"image": "https://slimm-x.github.io/comp/figs/teaser.png",
|
|
|
},
|
|
|
{"type": "text", "text": "Describe this image."},
|
|
|
],
|
|
|
}
|
|
|
]
|
|
|
|
|
|
# Preparation for inference
|
|
|
text = processor.apply_chat_template(
|
|
|
messages, tokenize=False, add_generation_prompt=True
|
|
|
)
|
|
|
image_inputs, video_inputs = process_vision_info(messages)
|
|
|
inputs = processor(
|
|
|
text=[text],
|
|
|
images=image_inputs,
|
|
|
videos=video_inputs,
|
|
|
padding=True,
|
|
|
return_tensors="pt",
|
|
|
)
|
|
|
inputs = inputs.to("cuda")
|
|
|
|
|
|
# Inference: Generation of the output
|
|
|
generated_ids = model.generate(**inputs, max_new_tokens=128)
|
|
|
generated_ids_trimmed = [
|
|
|
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
|
|
]
|
|
|
output_text = processor.batch_decode(
|
|
|
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
|
|
)
|
|
|
print(output_text)
|
|
|
```
|
|
|
|
|
|
## Citation
|
|
|
|
|
|
**BibTeX:**
|
|
|
|
|
|
```bibtex
|
|
|
@article{comp2025,
|
|
|
title={CoMP: Continual Multimodal Pre-training for Vision Foundation Models},
|
|
|
author={Chen, Yitong and Meng, Lingchen and Peng, Wujian and Wu, Zuxuan and Jiang, Yu-Gang},
|
|
|
year={2025},
|
|
|
journal={arXiv preprint arXiv:2503.18931},
|
|
|
}
|
|
|
``` |