
UTAustin-AIHealth
AI & ML interests
None defined yet.
Recent Activity
UTAustin-AIHealth
Welcome to UTAustin-AIHealth – a hub dedicated to advancing research in medical AI. This repo contains the MedHallu dataset, which underpins our recent work:
MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models
MedHallu is a rigorously designed benchmark intended to evaluate large language models' ability to detect hallucinations in medical question-answering tasks. The dataset is organized into two distinct splits:
- pqa_labeled: Contains 1,000 high-quality, human-annotated samples derived from PubMedQA.
- pqa_artificial: Contains 9,000 samples generated via an automated pipeline from PubMedQA.
Setup Environment
To work with the MedHallu dataset, please install the Hugging Face datasets
library using pip:
pip install datasets
How to Use MedHallu
Downloading the Dataset:
from datasets import load_dataset
# Load the 'pqa_labeled' split: 1,000 high-quality, human-annotated samples.
medhallu_labeled = load_dataset("UTAustin-AIHealth/MedHallu", "pqa_labeled")
# Load the 'pqa_artificial' split: 9,000 samples generated via an automated pipeline.
medhallu_artificial = load_dataset("UTAustin-AIHealth/MedHallu", "pqa_artificial")
License
This dataset and associated resources are distributed under the MIT License.
Citations
If you find MedHallu useful in your research, please consider citing our work:
@misc{pandit2025medhallucomprehensivebenchmarkdetecting,
title={MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models},
author={Shrey Pandit and Jiawei Xu and Junyuan Hong and Zhangyang Wang and Tianlong Chen and Kaidi Xu and Ying Ding},
year={2025},
eprint={2502.14302},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.14302},
}
Contact
For further information or inquiries about MedHallu, please reach out at [email protected]