RadEval: A framework for radiology text evaluation
Abstract
RadEval is a comprehensive framework for evaluating radiology texts using a variety of metrics, including n-gram overlap, contextual measures, clinical concept-based scores, and advanced LLM-based evaluators, with a focus on reproducibility and robust benchmarking.
We introduce RadEval, a unified, open-source framework for evaluating radiology texts. RadEval consolidates a diverse range of metrics, from classic n-gram overlap (BLEU, ROUGE) and contextual measures (BERTScore) to clinical concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT, TemporalEntityF1) and advanced LLM-based evaluators (GREEN). We refine and standardize implementations, extend GREEN to support multiple imaging modalities with a more lightweight model, and pretrain a domain-specific radiology encoder, demonstrating strong zero-shot retrieval performance. We also release a richly annotated expert dataset with over 450 clinically significant error labels and show how different metrics correlate with radiologist judgment. Finally, RadEval provides statistical testing tools and baseline model evaluations across multiple publicly available datasets, facilitating reproducibility and robust benchmarking in radiology report generation.
Community
🚀 RadEval will be presented as an Oral at EMNLP25
RadEval integrates 11+ state-of-the-art metrics, ranging from lexical and semantic to clinical and temporal, into a single easy-to-use framework.
Beyond existing benchmarks, RadEval introduces 🤗𝗥𝗮𝗱𝗘𝘃𝗮𝗹𝗕𝗘𝗥𝗧𝗦𝗰𝗼𝗿𝗲, a new domain-adapted metric that outperforms all prior text-based approaches for medical text evaluation.
The toolkit is paired with the 𝗥𝗮𝗱𝗘𝘃𝗮𝗹 𝗘𝘅𝗽𝗲𝗿𝘁 𝗗𝗮𝘁𝗮𝘀𝗲𝘁, a radiologist-annotated benchmark that distinguishes clinically significant from insignificant errors across multiple categories. 𝗧𝗵𝗲 𝗱𝗮𝘁𝗮𝘀𝗲𝘁 𝗶𝗻𝗰𝗹𝘂𝗱𝗲𝘀 𝟮𝟬𝟴 𝘀𝘁𝘂𝗱𝗶𝗲𝘀 (𝟭𝟰𝟴 𝗳𝗶𝗻𝗱𝗶𝗻𝗴𝘀 𝗮𝗻𝗱 𝟲𝟬 𝗶𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝗼𝗻𝘀), 𝘄𝗶𝘁𝗵 𝗲𝘅𝗮𝗰𝘁𝗹𝘆 𝟯 𝗮𝗻𝗻𝗼𝘁𝗮𝘁𝗲𝗱 𝗰𝗮𝗻𝗱𝗶𝗱𝗮𝘁𝗲 𝗿𝗲𝗽𝗼𝗿𝘁𝘀 𝗽𝗲𝗿 𝗴𝗿𝗼𝘂𝗻𝗱 𝘁𝗿𝘂𝘁𝗵. Ground-truth reports are sourced from MIMIC-CXR, CheXpert-Plus, and ReXGradient-160K, while candidate reports are generated by CheXagent, the CheXpert-Plus model, and MAIRA-2. This benchmark enables rigorous assessment of how automatic metrics align with expert radiologists’ judgments.
RadEval further supports statistical significance testing for system comparisons, detailed breakdowns per metric, and efficient batch processing for large-scale research.
🔗 Resources:
📦 GitHub: https://github.com/jbdel/RadEval
🤗 Model: https://huggingface.co/IAMJB/RadEvalModernBERT
🤗 Expert annotated dataset: https://huggingface.co/datasets/IAMJB/RadEvalExpertDataset
🎮 Online Demo: https://huggingface.co/spaces/X-iZhang/RadEval
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RadReason: Radiology Report Evaluation Metric with Reasons and Sub-Scores (2025)
- HARE: an entity and relation centric evaluation framework for histopathology reports (2025)
- Exploring the Capabilities of LLM Encoders for Image-Text Retrieval in Chest X-rays (2025)
- AMRG: Extend Vision Language Models for Automatic Mammography Report Generation (2025)
- Clinically Grounded Agent-based Report Evaluation: An Interpretable Metric for Radiology Report Generation (2025)
- PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation (2025)
- MOSAIC: A Multilingual, Taxonomy-Agnostic, and Computationally Efficient Approach for Radiological Report Classification (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper