Model Card for MedDINOv3

MedDINOv3 is a medical vision foundation model pretrained on CT-3M, a collection of 2D axial CT slices covering diverse anatomical regions. MedDINOv3 produces high-quality dense features that achieve strong performance on various CT segmentation tasks, significantly surpassing previous supervised CNN and transformer models.

Model Details

Model Description

We provide ViT-B-16 pretrained on CT-3M using the three-stage DINOv3 objective.

Developed by: Yuheng Li, Yizhou Wu, Yuxiang Lai, Mingzhe Hu, Xiaofeng Yang
Model type: Vision Transformer

Model Sources

Repository: GitHub – MedDINOv3
Paper: arXiv:2509.02379

Uses

The model is a vision backbone providing multi-purpose features for downstream medical imaging tasks.

Direct Use

Use as a frozen feature extractor for medical imaging tasks (e.g., segmentation, classification).
Fine-tuning within nnU-Net or other medical segmentation frameworks.

Out-of-Scope Use

The model is trained only on CT images. Direct use for MRI, ultrasound, or natural images without adaptation is not recommended.
Not validated for clinical decision-making without extensive downstream validation.

Bias, Risks, and Limitations

Training data is limited to CT scans from public datasets (16 sources). It may not generalize to underrepresented scanners, populations, or pathologies.
The model was not designed to ensure fairness across demographic subgroups.
Clinical deployment requires further validation to mitigate risks of false positives/negatives.

Recommendations

Perform task-specific fine-tuning before clinical use.
Validate on local datasets to assess generalization.

How to Get Started with the Model

Please follow the instructions in https://github.com/ricklisz/MedDINOv3

After setting up the repo, you can do:

import torch
from nnunetv2.training.nnUNetTrainer.dinov3.dinov3.models.vision_transformer import vit_base

# Initialize backbone
model = vit_base(drop_path_rate=0.0, layerscale_init=1.0e-05, n_storage_tokens=4, 
                    qkv_bias = False, mask_k_bias= True)
# Load MedDINOv3-CT3M checkpoint
chkpt = torch.load("MedDINOv3-B-CT3M.pth", map_location="cpu")
model.load_state_dict(chkpt, strict=False)

Training Details

Training Data

Dataset: CT-3M (3,868,833 axial slices from 16 public CT datasets)

Coverage: Over 100 anatomical structures across abdominal, thoracic, and pelvic regions

Citation

@article{li2025meddinov3,
  title={MedDINOv3: How to Adapt Vision Foundation Models for Medical Image Segmentation?},
  author={Li, Yuheng and Wu, Yizhou and Lai, Yuxiang and Hu, Mingzhe and Yang, Xiaofeng},
  journal={arXiv preprint arXiv:2509.02379},
  year={2025},
  url={https://arxiv.org/abs/2509.02379}
}