ppddddpp's picture
Update README.md
ccb63d2 verified
metadata
license: mit
tags:
  - chest-xray
  - medical
  - multimodal
  - retrieval
  - explanation
  - clinicalbert
  - swin-transformer
  - deep-learning
  - image-text
datasets:
  - openi
language:
  - en

Multimodal Chest X-ray Retrieval & Diagnosis (ClinicalBERT + Swin)

This model jointly encodes chest X-rays (DICOM) and radiology reports (XML) to:

  • Predict medical conditions from multimodal input (image + text)
  • Retrieve similar cases using shared disease-aware embeddings
  • Provide visual explanations using attention and Integrated Gradients (IG)

Developed as a final project at HCMUS.


Model Architecture

  • Image Encoder: Swin Transformer (pretrained, fine-tuned)
  • Text Encoder: ClinicalBERT
  • Fusion Module: Cross-modal attention with optional hybrid FFN layers
  • Losses: BCE + Focal Loss for multi-label classification

Embeddings from both modalities are projected into a shared joint space, enabling retrieval and explanation.


Training Data

  • Dataset: NIH Open-i Chest X-ray Dataset
  • Input Modalities:
    • Chest X-ray DICOMs
    • Associated XML radiology reports
  • Labels: MeSH-derived disease categories (multi-label)

Intended Uses

  • Clinical Education: Case similarity search for radiology students

  • Research: Baseline for multimodal medical retrieval

  • Explainability: Visualize disease evidence in both image and text

Limitations & Risks

  • Trained on a public dataset (Open-i) — may not generalize to other hospitals

  • Explanations are not clinically validated

  • Not for diagnostic use in real-world settings

Acknowledgments

  • NIH Open-i Dataset

  • Swin Transformer (Timm)

  • ClinicalBERT (Emily Alsentzer)

  • Captum (for IG explanations)

Code link: GitHub