Model Card for MedDINOv3

MedDINOv3 is a medical vision foundation model pretrained on CT-3M, a collection of 2D axial CT slices covering diverse anatomical regions. MedDINOv3 produces high-quality dense features that achieve strong performance on various CT segmentation tasks, significantly surpassing previous supervised CNN and transformer models.

Model Details

Model Description

We provide ViT-B-16 pretrained on CT-3M using the three-stage DINOv3 objective.

  • Developed by: Yuheng Li, Yizhou Wu, Yuxiang Lai, Mingzhe Hu, Xiaofeng Yang
  • Model type: Vision Transformer

Model Sources

Uses

The model is a vision backbone providing multi-purpose features for downstream medical imaging tasks.

Direct Use

  • Use as a frozen feature extractor for medical imaging tasks (e.g., segmentation, classification).
  • Fine-tuning within nnU-Net or other medical segmentation frameworks.

Out-of-Scope Use

  • The model is trained only on CT images. Direct use for MRI, ultrasound, or natural images without adaptation is not recommended.
  • Not validated for clinical decision-making without extensive downstream validation.

Bias, Risks, and Limitations

  • Training data is limited to CT scans from public datasets (16 sources). It may not generalize to underrepresented scanners, populations, or pathologies.
  • The model was not designed to ensure fairness across demographic subgroups.
  • Clinical deployment requires further validation to mitigate risks of false positives/negatives.

Recommendations

  • Perform task-specific fine-tuning before clinical use.
  • Validate on local datasets to assess generalization.

How to Get Started with the Model

Please follow the instructions in https://github.com/ricklisz/MedDINOv3

After setting up the repo, you can do:

import torch
from nnunetv2.training.nnUNetTrainer.dinov3.dinov3.models.vision_transformer import vit_base

# Initialize backbone
model = vit_base(drop_path_rate=0.0, layerscale_init=1.0e-05, n_storage_tokens=4, 
                    qkv_bias = False, mask_k_bias= True)
# Load MedDINOv3-CT3M checkpoint
chkpt = torch.load("MedDINOv3-B-CT3M.pth", map_location="cpu")
model.load_state_dict(chkpt, strict=False)

Training Details

Training Data

Dataset: CT-3M (3,868,833 axial slices from 16 public CT datasets)

Coverage: Over 100 anatomical structures across abdominal, thoracic, and pelvic regions

Citation

@article{li2025meddinov3,
  title={MedDINOv3: How to Adapt Vision Foundation Models for Medical Image Segmentation?},
  author={Li, Yuheng and Wu, Yizhou and Lai, Yuxiang and Hu, Mingzhe and Yang, Xiaofeng},
  journal={arXiv preprint arXiv:2509.02379},
  year={2025},
  url={https://arxiv.org/abs/2509.02379}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support