metadata
license: mit
tags:
- chest-xray
- medical
- multimodal
- retrieval
- explanation
- clinicalbert
- swin-transformer
- deep-learning
- image-text
datasets:
- openi
language:
- en
Multimodal Chest X-ray Retrieval & Diagnosis (ClinicalBERT + Swin)
This model jointly encodes chest X-rays (DICOM) and radiology reports (XML) to:
- Predict medical conditions from multimodal input (image + text)
- Retrieve similar cases using shared disease-aware embeddings
- Provide visual explanations using attention and Integrated Gradients (IG)
Developed as a final project at HCMUS.
Model Architecture
- Image Encoder: Swin Transformer (pretrained, fine-tuned)
- Text Encoder: ClinicalBERT
- Fusion Module: Cross-modal attention with optional hybrid FFN layers
- Losses: BCE + Focal Loss for multi-label classification
Embeddings from both modalities are projected into a shared joint space, enabling retrieval and explanation.
Training Data
- Dataset: NIH Open-i Chest X-ray Dataset
- Input Modalities:
- Chest X-ray DICOMs
- Associated XML radiology reports
- Labels: MeSH-derived disease categories (multi-label)
Intended Uses
Clinical Education: Case similarity search for radiology students
Research: Baseline for multimodal medical retrieval
Explainability: Visualize disease evidence in both image and text
Limitations & Risks
Trained on a public dataset (Open-i) — may not generalize to other hospitals
Explanations are not clinically validated
Not for diagnostic use in real-world settings
Acknowledgments
NIH Open-i Dataset
Swin Transformer (Timm)
ClinicalBERT (Emily Alsentzer)
Captum (for IG explanations)