--- base_model: - facebook/dinov2-large license: apache-2.0 pipeline_tag: image-feature-extraction library_name: transformers --- # Model Card for CoMP-MM-1B This is an VFM that supports native image resolution inputs, continually pre-trained from [DINOv2](https://huggingface.co/facebook/dinov2-large). ## Model Sources - **Repository:** https://github.com/SliMM-X/CoMP-MM - **Paper:** https://arxiv.org/abs/2503.18931 - **Project Page:** https://slimm-x.github.io/comp ## How to Get Started with the Model Install the github repo, and use the code below to get started with the model. ```python import torch from slimm.model.processor import SliMMQwen2VLProcessor from slimm.model.utils_vl import process_vision_info from slimm.model.vision_encoder import CoMPDinov2Model from PIL import Image model_path = "SliMM-X/CoMP-DINOv2-Large" model = CoMPDinov2Model.from_pretrained( model_path, torch_dtype="auto", device_map="cuda", w_merger=False ).to(torch.bfloat16) processor = SliMMQwen2VLProcessor.from_pretrained(model_path) image_input = Image.open("https://slimm-x.github.io/comp/figs/teaser.png") inputs = processor( images=image_input, return_tensors="pt", ) inputs = inputs.to("cuda") output_feat = model(inputs.pixel_values.to(torch.bfloat16), inputs.image_grid_thw) print(output_feat) ``` ## Citation **BibTeX:** ```bibtex @article{comp2025, title={CoMP: Continual Multimodal Pre-training for Vision Foundation Models}, author={Chen, Yitong and Meng, Lingchen and Peng, Wujian and Wu, Zuxuan and Jiang, Yu-Gang}, year={2025}, journal={arXiv preprint arXiv:2503.18931}, } ```