metadata

base_model:
  - facebook/dinov2-large
license: apache-2.0
pipeline_tag: image-feature-extraction
library_name: transformers

Model Card for CoMP-MM-1B

This is an VFM that supports native image resolution inputs, continually pre-trained from DINOv2.

Model Sources

Repository: https://github.com/SliMM-X/CoMP-MM
Paper: https://arxiv.org/abs/2503.18931
Project Page: https://slimm-x.github.io/comp

How to Get Started with the Model

Install the github repo, and use the code below to get started with the model.

import torch
from slimm.model.processor import SliMMQwen2VLProcessor
from slimm.model.utils_vl import process_vision_info
from slimm.model.vision_encoder import CoMPDinov2Model
from PIL import Image

model_path = "SliMM-X/CoMP-DINOv2-Large"

model = CoMPDinov2Model.from_pretrained(
    model_path, torch_dtype="auto", device_map="cuda", w_merger=False
).to(torch.bfloat16)

processor = SliMMQwen2VLProcessor.from_pretrained(model_path)

image_input = Image.open("https://slimm-x.github.io/comp/figs/teaser.png")
inputs = processor(
    images=image_input,
    return_tensors="pt",
)

inputs = inputs.to("cuda")
output_feat = model(inputs.pixel_values.to(torch.bfloat16), inputs.image_grid_thw)
print(output_feat)

Citation

BibTeX:

@article{comp2025,
      title={CoMP: Continual Multimodal Pre-training for Vision Foundation Models}, 
      author={Chen, Yitong and Meng, Lingchen and Peng, Wujian and Wu, Zuxuan and Jiang, Yu-Gang},
      year={2025},
      journal={arXiv preprint arXiv:2503.18931}, 
}