SliMM-X
/

CoMP-MM-1B

Image-Text-to-Text

Model card Files Files and versions

CoMP-MM-1B / README.md

Row11n's picture

Improve language tag (#2)

0e8a891 verified 6 months ago

|

history blame contribute delete

2.68 kB

	---
	base_model:
	- Qwen/Qwen2.5-0.5B-Instruct
	license: apache-2.0
	pipeline_tag: image-text-to-text
	library_name: slimm
	language:
	- zho
	- eng
	- fra
	- spa
	- por
	- deu
	- ita
	- rus
	- jpn
	- kor
	- vie
	- tha
	- ara
	---

	# Model Card for CoMP-MM-1B

	<!-- Provide a quick summary of what the model is/does. -->
	This is an LMM that supports native image resolution inputs, composed of [CoMP-SigLIP](https://huggingface.co/SliMM-X/CoMP-SigLIP-So400M) and [Qwen2.5](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct).

	## Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: https://github.com/SliMM-X/CoMP-MM
	- Paper: https://arxiv.org/abs/2503.18931
	- Project Page: https://slimm-x.github.io/comp

	## How to Get Started with the Model

	Install the github repo, and use the code below to get started with the model.

	```python
	# this is very similar to qwen2-vl
	from slimm.model.processor import SliMMQwen2VLProcessor
	from slimm.model.slimm import SliMMForConditionalGeneration
	from slimm.model.utils_vl import process_vision_info

	model_path = "SliMM-X/CoMP-MM-1B"

	model = SliMMForConditionalGeneration.from_pretrained(
	model_path, torch_dtype="auto", device_map="cuda"
	)
	processor = SliMMQwen2VLProcessor.from_pretrained(model_path)

	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "https://slimm-x.github.io/comp/figs/teaser.png",
	},
	{"type": "text", "text": "Describe this image."},
	],
	}
	]

	# Preparation for inference
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to("cuda")

	# Inference: Generation of the output
	generated_ids = model.generate(**inputs, max_new_tokens=128)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)
	```

	## Citation

	BibTeX:

	```bibtex
	@article{comp2025,
	title={CoMP: Continual Multimodal Pre-training for Vision Foundation Models},
	author={Chen, Yitong and Meng, Lingchen and Peng, Wujian and Wu, Zuxuan and Jiang, Yu-Gang},
	year={2025},
	journal={arXiv preprint arXiv:2503.18931},
	}
	```