--- base_model: - Qwen/Qwen2.5-0.5B-Instruct license: apache-2.0 pipeline_tag: image-text-to-text library_name: slimm language: - zho - eng - fra - spa - por - deu - ita - rus - jpn - kor - vie - tha - ara --- # Model Card for CoMP-MM-1B This is an LMM that supports **native image resolution inputs**, composed of [CoMP-SigLIP](https://huggingface.co/SliMM-X/CoMP-SigLIP-So400M) and [Qwen2.5](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct). ## Model Sources - **Repository:** https://github.com/SliMM-X/CoMP-MM - **Paper:** https://arxiv.org/abs/2503.18931 - **Project Page:** https://slimm-x.github.io/comp ## How to Get Started with the Model Install the github repo, and use the code below to get started with the model. ```python # this is very similar to qwen2-vl from slimm.model.processor import SliMMQwen2VLProcessor from slimm.model.slimm import SliMMForConditionalGeneration from slimm.model.utils_vl import process_vision_info model_path = "SliMM-X/CoMP-MM-1B" model = SliMMForConditionalGeneration.from_pretrained( model_path, torch_dtype="auto", device_map="cuda" ) processor = SliMMQwen2VLProcessor.from_pretrained(model_path) messages = [ { "role": "user", "content": [ { "type": "image", "image": "https://slimm-x.github.io/comp/figs/teaser.png", }, {"type": "text", "text": "Describe this image."}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` ## Citation **BibTeX:** ```bibtex @article{comp2025, title={CoMP: Continual Multimodal Pre-training for Vision Foundation Models}, author={Chen, Yitong and Meng, Lingchen and Peng, Wujian and Wu, Zuxuan and Jiang, Yu-Gang}, year={2025}, journal={arXiv preprint arXiv:2503.18931}, } ```