How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild
Abstract
In the age of misinformation, hallucination -- the tendency of Large Language Models (LLMs) to generate non-factual or unfaithful responses -- represents the main risk for their global utility. Despite LLMs becoming increasingly multilingual, the vast majority of research on detecting and quantifying LLM hallucination are (a) English-centric and (b) focus on machine translation (MT) and summarization, tasks that are less common ``in the wild'' than open information seeking. In contrast, we aim to quantify the extent of LLM hallucination across languages in knowledge-intensive long-form question answering. To this end, we train a multilingual hallucination detection model and conduct a large-scale study across 30 languages and 6 open-source LLM families. We start from an English hallucination detection dataset and rely on MT to generate (noisy) training data in other languages. We also manually annotate gold data for five high-resource languages; we then demonstrate, for these languages, that the estimates of hallucination rates are similar between silver (LLM-generated) and gold test sets, validating the use of silver data for estimating hallucination rates for other languages. For the final rates estimation, we build a knowledge-intensive QA dataset for 30 languages with LLM-generated prompts and Wikipedia articles as references. We find that, while LLMs generate longer responses with more hallucinated tokens for higher-resource languages, there is no correlation between length-normalized hallucination rates of languages and their digital representation. Further, we find that smaller LLMs exhibit larger hallucination rates than larger models.
Community
The paper presents the first effort towards understanding how much multilingual LLMs hallucinate “in the
wild”. To this end, we proposed a novel framework for hallucination rate estimation, which adjusts the
number of detected hallucinations based on the detector’s performance resulting in more reliable
rate estimates. We trained a series of multilingual detection models, and measured their precision
and recall scores on our newly created MFAVA datasets across 30 languages. To estimate hallucinations, we build a novel synthetic open-domain knowledge-intensive QA dataset for which we collected answers from eleven open-source LLMs. Our findings indicate that smaller models and models that cover more languages hallucinate significantly more, and that model response-length does
not correlate with hallucination rate
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- REFIND: Retrieval-Augmented Factuality Hallucination Detection in Large Language Models (2025)
- SelfCheckAgent: Zero-Resource Hallucination Detection in Generative Large Language Models (2025)
- Mitigating Hallucinated Translations in Large Language Models with Hallucination-focused Preference Optimization (2025)
- Can Your Uncertainty Scores Detect Hallucinated Entity? (2025)
- HuDEx: Integrating Hallucination Detection and Explainability for Enhancing the Reliability of LLM responses (2025)
- Smoothing Out Hallucinations: Mitigating LLM Hallucination with Smoothed Knowledge Distillation (2025)
- Post-training an LLM for RAG? Train on Self-Generated Demonstrations (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper