Model Card for MedDINOv3
MedDINOv3 is a medical vision foundation model pretrained on CT-3M, a collection of 2D axial CT slices covering diverse anatomical regions. MedDINOv3 produces high-quality dense features that achieve strong performance on various CT segmentation tasks, significantly surpassing previous supervised CNN and transformer models.
Model Details
Model Description
We provide ViT-B-16 pretrained on CT-3M using the three-stage DINOv3 objective.
- Developed by: Yuheng Li, Yizhou Wu, Yuxiang Lai, Mingzhe Hu, Xiaofeng Yang
- Model type: Vision Transformer
Model Sources
- Repository: GitHub – MedDINOv3
- Paper: arXiv:2509.02379
Uses
The model is a vision backbone providing multi-purpose features for downstream medical imaging tasks.
Direct Use
- Use as a frozen feature extractor for medical imaging tasks (e.g., segmentation, classification).
- Fine-tuning within nnU-Net or other medical segmentation frameworks.
Out-of-Scope Use
- The model is trained only on CT images. Direct use for MRI, ultrasound, or natural images without adaptation is not recommended.
- Not validated for clinical decision-making without extensive downstream validation.
Bias, Risks, and Limitations
- Training data is limited to CT scans from public datasets (16 sources). It may not generalize to underrepresented scanners, populations, or pathologies.
- The model was not designed to ensure fairness across demographic subgroups.
- Clinical deployment requires further validation to mitigate risks of false positives/negatives.
Recommendations
- Perform task-specific fine-tuning before clinical use.
- Validate on local datasets to assess generalization.
How to Get Started with the Model
Please follow the instructions in https://github.com/ricklisz/MedDINOv3
After setting up the repo, you can do:
import torch
from nnunetv2.training.nnUNetTrainer.dinov3.dinov3.models.vision_transformer import vit_base
# Initialize backbone
model = vit_base(drop_path_rate=0.0, layerscale_init=1.0e-05, n_storage_tokens=4,
qkv_bias = False, mask_k_bias= True)
# Load MedDINOv3-CT3M checkpoint
chkpt = torch.load("MedDINOv3-B-CT3M.pth", map_location="cpu")
model.load_state_dict(chkpt, strict=False)
Training Details
Training Data
Dataset: CT-3M (3,868,833 axial slices from 16 public CT datasets)
Coverage: Over 100 anatomical structures across abdominal, thoracic, and pelvic regions
Citation
@article{li2025meddinov3,
title={MedDINOv3: How to Adapt Vision Foundation Models for Medical Image Segmentation?},
author={Li, Yuheng and Wu, Yizhou and Lai, Yuxiang and Hu, Mingzhe and Yang, Xiaofeng},
journal={arXiv preprint arXiv:2509.02379},
year={2025},
url={https://arxiv.org/abs/2509.02379}
}