SliMM-X
/

CoMP-DINOv2-Large

Image Feature Extraction

Model card Files Files and versions

CoMP-DINOv2-Large / README.md

nielsr's picture

nielsr HF Staff

Add library name and project page link

5d1e51c verified 6 months ago

|

1.75 kB

	---
	base_model:
	- facebook/dinov2-large
	license: apache-2.0
	pipeline_tag: image-feature-extraction
	library_name: transformers
	---

	# Model Card for CoMP-MM-1B

	<!-- Provide a quick summary of what the model is/does. -->
	This is an VFM that supports <b>native image resolution inputs</b>, continually pre-trained from [DINOv2](https://huggingface.co/facebook/dinov2-large).

	## Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: https://github.com/SliMM-X/CoMP-MM
	- Paper: https://arxiv.org/abs/2503.18931
	- Project Page: https://slimm-x.github.io/comp

	## How to Get Started with the Model

	Install the github repo, and use the code below to get started with the model.

	```python
	import torch
	from slimm.model.processor import SliMMQwen2VLProcessor
	from slimm.model.utils_vl import process_vision_info
	from slimm.model.vision_encoder import CoMPDinov2Model
	from PIL import Image

	model_path = "SliMM-X/CoMP-DINOv2-Large"

	model = CoMPDinov2Model.from_pretrained(
	model_path, torch_dtype="auto", device_map="cuda", w_merger=False
	).to(torch.bfloat16)

	processor = SliMMQwen2VLProcessor.from_pretrained(model_path)

	image_input = Image.open("https://slimm-x.github.io/comp/figs/teaser.png")
	inputs = processor(
	images=image_input,
	return_tensors="pt",
	)

	inputs = inputs.to("cuda")
	output_feat = model(inputs.pixel_values.to(torch.bfloat16), inputs.image_grid_thw)
	print(output_feat)
	```

	## Citation

	BibTeX:

	```bibtex
	@article{comp2025,
	title={CoMP: Continual Multimodal Pre-training for Vision Foundation Models},
	author={Chen, Yitong and Meng, Lingchen and Peng, Wujian and Wu, Zuxuan and Jiang, Yu-Gang},
	year={2025},
	journal={arXiv preprint arXiv:2503.18931},
	}
	```