--- license: apache-2.0 pipeline_tag: image-segmentation tags: - medical - vision-transformer - dinov3 - CT --- # Model Card for MedDINOv3 MedDINOv3 is a medical vision foundation model pretrained on CT-3M, a collection of 2D axial CT slices covering diverse anatomical regions. MedDINOv3 produces high-quality dense features that achieve strong performance on various CT segmentation tasks, significantly surpassing previous supervised CNN and transformer models. ## Model Details ### Model Description We provide ViT-B-16 pretrained on CT-3M using the three-stage DINOv3 objective. - **Developed by:** Yuheng Li, Yizhou Wu, Yuxiang Lai, Mingzhe Hu, Xiaofeng Yang - **Model type:** Vision Transformer ### Model Sources - **Repository:** [GitHub – MedDINOv3](https://github.com/ricklisz/MedDINOv3) - **Paper:** [arXiv:2509.02379](https://arxiv.org/abs/2509.02379) ## Uses The model is a vision backbone providing multi-purpose features for downstream medical imaging tasks. ### Direct Use - Use as a **frozen feature extractor** for medical imaging tasks (e.g., segmentation, classification). - Fine-tuning within **nnU-Net** or other medical segmentation frameworks. ### Out-of-Scope Use - The model is trained only on **CT images**. Direct use for MRI, ultrasound, or natural images without adaptation is not recommended. - Not validated for **clinical decision-making** without extensive downstream validation. ## Bias, Risks, and Limitations - Training data is limited to CT scans from public datasets (16 sources). It may not generalize to underrepresented scanners, populations, or pathologies. - The model was not designed to ensure fairness across demographic subgroups. - Clinical deployment requires further validation to mitigate risks of false positives/negatives. ### Recommendations - Perform **task-specific fine-tuning** before clinical use. - Validate on **local datasets** to assess generalization. ## How to Get Started with the Model Please follow the instructions in https://github.com/ricklisz/MedDINOv3 After setting up the repo, you can do: ```python import torch from nnunetv2.training.nnUNetTrainer.dinov3.dinov3.models.vision_transformer import vit_base # Initialize backbone model = vit_base(drop_path_rate=0.0, layerscale_init=1.0e-05, n_storage_tokens=4, qkv_bias = False, mask_k_bias= True) # Load MedDINOv3-CT3M checkpoint chkpt = torch.load("MedDINOv3-B-CT3M.pth", map_location="cpu") model.load_state_dict(chkpt, strict=False) ``` ## Training Details ### Training Data Dataset: CT-3M (3,868,833 axial slices from 16 public CT datasets) Coverage: Over 100 anatomical structures across abdominal, thoracic, and pelvic regions ## Citation ``` @article{li2025meddinov3, title={MedDINOv3: How to Adapt Vision Foundation Models for Medical Image Segmentation?}, author={Li, Yuheng and Wu, Yizhou and Lai, Yuxiang and Hu, Mingzhe and Yang, Xiaofeng}, journal={arXiv preprint arXiv:2509.02379}, year={2025}, url={https://arxiv.org/abs/2509.02379} } ```