HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations
By: Samir Abdaljalil, Hasan Kurban, Erchin Serpedin
Potential Business Impact:
Helps AI tell truth from lies in many languages.
Large Language Models (LLMs) are increasingly used in various contexts, yet remain prone to generating non-factual content, commonly referred to as "hallucinations". The literature categorizes hallucinations into several types, including entity-level, relation-level, and sentence-level hallucinations. However, existing hallucination datasets often fail to capture fine-grained hallucinations in multilingual settings. In this work, we introduce HalluVerse25, a multilingual LLM hallucination dataset that categorizes fine-grained hallucinations in English, Arabic, and Turkish. Our dataset construction pipeline uses an LLM to inject hallucinations into factual biographical sentences, followed by a rigorous human annotation process to ensure data quality. We evaluate several LLMs on HalluVerse25, providing valuable insights into how proprietary models perform in detecting LLM-generated hallucinations across different contexts.
Similar Papers
HalluLens: LLM Hallucination Benchmark
Computation and Language
Stops AI from making up fake answers.
AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs
Computation and Language
Checks if AI makes up facts in Arabic.
Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-Language Models
CV and Pattern Recognition
Teaches AI to spot and fix fake medical image descriptions.