MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM
By: Bowen Dong , Minheng Ni , Zitong Huang and more
Potential Business Impact:
Fixes AI mistakes when it "sees" and "thinks."
Multimodal hallucination in multimodal large language models (MLLMs) restricts the correctness of MLLMs. However, multimodal hallucinations are multi-sourced and arise from diverse causes. Existing benchmarks fail to adequately distinguish between perception-induced hallucinations and reasoning-induced hallucinations. This failure constitutes a significant issue and hinders the diagnosis of multimodal reasoning failures within MLLMs. To address this, we propose the {\dataset} benchmark, which isolates reasoning hallucinations by constructing questions where input images are correctly perceived by MLLMs yet reasoning errors persist. {\dataset} introduces multi-granular evaluation metrics: accuracy, factuality, and LLMs hallucination score for hallucination quantification. Our analysis reveals that (1) the model scale, data scale, and training stages significantly affect the degree of logical, fabrication, and factual hallucinations; (2) current MLLMs show no effective improvement on spatial hallucinations caused by misinterpreted spatial relationships, indicating their limited visual reasoning capabilities; and (3) question types correlate with distinct hallucination patterns, highlighting targeted challenges and potential mitigation strategies. To address these challenges, we propose {\method}, a method that combines curriculum reinforcement fine-tuning to encourage models to generate logic-consistent reasoning chains by stepwise reducing learning difficulty, and collaborative hint inference to reduce reasoning complexity. {\method} establishes a baseline on {\dataset}, and reduces the logical hallucinations in original base models.
Similar Papers
Detection and Mitigation of Hallucination in Large Reasoning Models: A Mechanistic Perspective
Artificial Intelligence
Finds and fixes when smart computers make up answers.
Evaluating Evaluation Metrics -- The Mirage of Hallucination Detection
Computation and Language
Makes AI less likely to make up facts.
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
CV and Pattern Recognition
Makes AI less likely to make up wrong video answers.