Detecting Hallucinations in Authentic LLM-Human Interactions
By: Yujie Ren, Niklas Gruhlke, Anne Lauscher
Potential Business Impact:
Finds when AI lies in real conversations.
As large language models (LLMs) are increasingly applied in sensitive domains such as medicine and law, hallucination detection has become a critical task. Although numerous benchmarks have been proposed to advance research in this area, most of them are artificially constructed--either through deliberate hallucination induction or simulated interactions--rather than derived from genuine LLM-human dialogues. Consequently, these benchmarks fail to fully capture the characteristics of hallucinations that occur in real-world usage. To address this limitation, we introduce AuthenHallu, the first hallucination detection benchmark built entirely from authentic LLM-human interactions. For AuthenHallu, we select and annotate samples from genuine LLM-human dialogues, thereby providing a faithful reflection of how LLMs hallucinate in everyday user interactions. Statistical analysis shows that hallucinations occur in 31.4% of the query-response pairs in our benchmark, and this proportion increases dramatically to 60.0% in challenging domains such as Math & Number Problems. Furthermore, we explore the potential of using vanilla LLMs themselves as hallucination detectors and find that, despite some promise, their current performance remains insufficient in real-world scenarios.
Similar Papers
HalluLens: LLM Hallucination Benchmark
Computation and Language
Stops AI from making up fake answers.
HalluDetect: Detecting, Mitigating, and Benchmarking Hallucinations in Conversational Systems
Computation and Language
Makes chatbots tell the truth, not make things up.
A comprehensive taxonomy of hallucinations in Large Language Models
Computation and Language
Makes AI tell the truth, not make things up.