Diving into Mitigating Hallucinations from a Vision Perspective for Large Vision-Language Models
By: Weihang Wang , Xinhao Li , Ziyue Wang and more
Potential Business Impact:
Fixes AI mistakes when describing pictures.
Object hallucination in Large Vision-Language Models (LVLMs) significantly impedes their real-world applicability. As the primary component for accurately interpreting visual information, the choice of visual encoder is pivotal. We hypothesize that the diverse training paradigms employed by different visual encoders instill them with distinct inductive biases, which leads to their diverse hallucination performances. Existing benchmarks typically focus on coarse-grained hallucination detection and fail to capture the diverse hallucinations elaborated in our hypothesis. To systematically analyze these effects, we introduce VHBench-10, a comprehensive benchmark with approximately 10,000 samples for evaluating LVLMs across ten fine-grained hallucination categories. Our evaluations confirm encoders exhibit unique hallucination characteristics. Building on these insights and the suboptimality of simple feature fusion, we propose VisionWeaver, a novel Context-Aware Routing Network. It employs global visual features to generate routing signals, dynamically aggregating visual features from multiple specialized experts. Comprehensive experiments confirm the effectiveness of VisionWeaver in significantly reducing hallucinations and improving overall model performance.
Similar Papers
A Comprehensive Analysis for Visual Object Hallucination in Large Vision-Language Models
CV and Pattern Recognition
Fixes AI mistakes when it sees and talks.
Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats
CV and Pattern Recognition
Fixes AI that makes up answers when it sees pictures.
On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models
CV and Pattern Recognition
Stops AI from making up things it sees.