Hallucination Detection and Evaluation of Large Language Model
By: Chenggong Zhang, Haopeng Wang
Potential Business Impact:
Finds fake answers from smart computer programs.
Hallucinations in Large Language Models (LLMs) pose a significant challenge, generating misleading or unverifiable content that undermines trust and reliability. Existing evaluation methods, such as KnowHalu, employ multi-stage verification but suffer from high computational costs. To address this, we integrate the Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework that operates independently of LLM-based judgments, significantly improving efficiency while maintaining high detection accuracy. We conduct a comparative analysis of hallucination detection methods across various LLMs, evaluating True Positive Rate (TPR), True Negative Rate (TNR), and Accuracy on question-answering (QA) and summarization tasks. Our results show that HHEM reduces evaluation time from 8 hours to 10 minutes, while HHEM with non-fabrication checking achieves the highest accuracy \(82.2\%\) and TPR \(78.9\%\). However, HHEM struggles with localized hallucinations in summarization tasks. To address this, we introduce segment-based retrieval, improving detection by verifying smaller text components. Additionally, our cumulative distribution function (CDF) analysis indicates that larger models (7B-9B parameters) generally exhibit fewer hallucinations, while intermediate-sized models show higher instability. These findings highlight the need for structured evaluation frameworks that balance computational efficiency with robust factual validation, enhancing the reliability of LLM-generated content.
Similar Papers
The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs
Computation and Language
Fixes AI mistakes that humans can't see.
A Survey of Multimodal Hallucination Evaluation and Detection
CV and Pattern Recognition
Fixes AI that makes up fake things.
Fine-tuning Large Language Models for Improving Factuality in Legal Question Answering
Computation and Language
Makes AI answer legal questions truthfully and accurately.