File size: 3,062 Bytes
f4d2faf
 
20d233b
 
 
 
 
 
f4d2faf
 
 
 
 
 
 
 
 
20d233b
 
f4d2faf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20d233b
 
f4d2faf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20d233b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
license: apache-2.0
pipeline_tag: image-segmentation
tags:
- medical
- vision-transformer
- dinov3
- CT
---

# Model Card for MedDINOv3

MedDINOv3 is a medical vision foundation model pretrained on CT-3M, a collection of 2D axial CT slices covering diverse anatomical regions. MedDINOv3 produces high-quality dense features that achieve strong performance on various CT segmentation tasks, significantly surpassing previous supervised CNN and transformer models.
## Model Details

### Model Description

We provide ViT-B-16 pretrained on CT-3M using the three-stage DINOv3 objective.  
- **Developed by:** Yuheng Li, Yizhou Wu, Yuxiang Lai, Mingzhe Hu, Xiaofeng Yang  
- **Model type:** Vision Transformer

### Model Sources

- **Repository:** [GitHub – MedDINOv3](https://github.com/ricklisz/MedDINOv3)  
- **Paper:** [arXiv:2509.02379](https://arxiv.org/abs/2509.02379)  


## Uses

The model is a vision backbone providing multi-purpose features for downstream medical imaging tasks.

### Direct Use
- Use as a **frozen feature extractor** for medical imaging tasks (e.g., segmentation, classification).  
- Fine-tuning within **nnU-Net** or other medical segmentation frameworks.  

### Out-of-Scope Use
- The model is trained only on **CT images**. Direct use for MRI, ultrasound, or natural images without adaptation is not recommended.  
- Not validated for **clinical decision-making** without extensive downstream validation.  


## Bias, Risks, and Limitations
- Training data is limited to CT scans from public datasets (16 sources). It may not generalize to underrepresented scanners, populations, or pathologies.  
- The model was not designed to ensure fairness across demographic subgroups.  
- Clinical deployment requires further validation to mitigate risks of false positives/negatives.  

### Recommendations
- Perform **task-specific fine-tuning** before clinical use.  
- Validate on **local datasets** to assess generalization.  

## How to Get Started with the Model

Please follow the instructions in https://github.com/ricklisz/MedDINOv3

After setting up the repo, you can do:

```python
import torch
from nnunetv2.training.nnUNetTrainer.dinov3.dinov3.models.vision_transformer import vit_base

# Initialize backbone
model = vit_base(drop_path_rate=0.0, layerscale_init=1.0e-05, n_storage_tokens=4, 
                    qkv_bias = False, mask_k_bias= True)
# Load MedDINOv3-CT3M checkpoint
chkpt = torch.load("MedDINOv3-B-CT3M.pth", map_location="cpu")
model.load_state_dict(chkpt, strict=False)
```

## Training Details

### Training Data

Dataset: CT-3M (3,868,833 axial slices from 16 public CT datasets)

Coverage: Over 100 anatomical structures across abdominal, thoracic, and pelvic regions


## Citation
```
@article{li2025meddinov3,
  title={MedDINOv3: How to Adapt Vision Foundation Models for Medical Image Segmentation?},
  author={Li, Yuheng and Wu, Yizhou and Lai, Yuxiang and Hu, Mingzhe and Yang, Xiaofeng},
  journal={arXiv preprint arXiv:2509.02379},
  year={2025},
  url={https://arxiv.org/abs/2509.02379}
}
```