FelixNuetzel
/

cxr_bert_ldm

@@ -1,14 +1,46 @@
 ---
 license: mit
 ---
-This is the checkpoint for Stable Diffusion fine-tuned with CXR-BERT on the MIMIC-CXR dataset.
-This checkpoint can be used to reproduce the results of the Generate to Ground paper.
-The corresponding code can be found here: https://github.com/Felix-012/generate_to_ground/.
-Training was performed for 30,000 steps using eight A100 GPUs.
-Stable Diffusion: https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5 \
-CXR-BERT: https://huggingface.co/microsoft/BiomedVLP-CXR-BERT-specialized \
-MIMIC-CXR: https://physionet.org/content/mimic-cxr/2.0.0/

 ---
 license: mit
+pipeline_tag: zero-shot-object-detection
+library_name: diffusers
 ---
+# Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models
+This is a checkpoint for Stable Diffusion fine-tuned with CXR-BERT on the MIMIC-CXR dataset, as presented in the paper [Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models](https://huggingface.co/papers/2507.12236).
+This model introduces a novel approach to **phrase grounding** in medical imaging, demonstrating that generative text-to-image diffusion models, specifically fine-tuned Stable Diffusion, can achieve superior zero-shot performance compared to traditional discriminative methods. Key innovations include:
+- Leveraging cross-attention maps from generative diffusion models for phrase grounding.
+- Fine-tuning diffusion models with a frozen, domain-specific language model (CXR-BERT) to significantly improve performance in medical contexts.
+- Introducing **Bimodal Bias Merging (BBM)**, a novel post-processing technique that aligns text and image biases to refine cross-attention maps and enhance localization accuracy.
+The model aims to map natural language phrases from clinical reports to specific image regions, facilitating disease localization. Training was performed for 30,000 steps using eight A100 GPUs.
+## Base Models and Datasets
+*   **Stable Diffusion:** [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5)
+*   **CXR-BERT:** [microsoft/BiomedVLP-CXR-BERT-specialized](https://huggingface.co/microsoft/BiomedVLP-CXR-BERT-specialized)
+*   **MIMIC-CXR (dataset):** [physionet.org/content/mimic-cxr/2.0.0/](https://physionet.org/content/mimic-cxr/2.0.0/)
+## Usage and Reproduction
+To reproduce the results of the "Generate to Ground" paper, including environment setup, data preparation, and execution of evaluation scripts (with and without Bimodal Bias Merging), please refer to the official GitHub repository. The repository provides comprehensive instructions and the corresponding code:
+[https://github.com/Felix-012/generate_to_ground/](https://github.com/Felix-012/generate_to_ground/)
+## Citation
+If you find this work helpful or inspiring, please consider citing the original paper:
+```bibtex
+@inproceedings{
+nutzel2025generate,
+title={Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models},
+author={Felix N{\"u}tzel and Mischa Dombrowski and Bernhard Kainz},
+booktitle={Medical Imaging with Deep Learning},
+year={2025},
+url={https://openreview.net/forum?id=yTjotBI30L}
+}
+```
+## Acknowledgement
+(Some) HPC resources were provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) under the NHR projects b143dc and b180dc. NHR funding is provided by federal and Bavarian state authorities. NHR@FAU hardware is partially funded by the German Research Foundation (DFG) – 440719683.