Score: 1

VEGAS: Mitigating Hallucinations in Large Vision-Language Models via Vision-Encoder Attention Guided Adaptive Steering

Published: December 12, 2025 | arXiv ID: 2512.12089v1

By: Zihu Wang , Boxun Xu , Yuxuan Xia and more

Potential Business Impact:

Makes AI see pictures better, less mistakes.

Business Areas:

Computer Vision Hardware, Software

Large vision-language models (LVLMs) exhibit impressive ability to jointly reason over visual and textual inputs. However, they often produce outputs that are linguistically fluent but factually inconsistent with the visual evidence, i.e., they hallucinate. Despite growing efforts to mitigate such hallucinations, a key question remains: what form of visual attention can effectively suppress hallucinations during decoding? In this work, we provide a simple answer: the vision encoder's own attention map. We show that LVLMs tend to hallucinate when their final visual-attention maps fail to concentrate on key image objects, whereas the vision encoder's more concentrated attention maps substantially reduce hallucinations. To further investigate the cause, we analyze vision-text conflicts during decoding and find that these conflicts peak in the language model's middle layers. Injecting the vision encoder's attention maps into these layers effectively suppresses hallucinations. Building on these insights, we introduce VEGAS, a simple yet effective inference-time method that integrates the vision encoder's attention maps into the language model's mid-layers and adaptively steers tokens which fail to concentrate on key image objects. Extensive experiments across multiple benchmarks demonstrate that VEGAS consistently achieves state-of-the-art performance in reducing hallucinations.

Diving into Mitigating Hallucinations from a Vision Perspective for Large Vision-Language Models

CV and Pattern Recognition

Fixes AI mistakes when describing pictures.

17 Sep 2025 1

93%

Toward More Reliable Artificial Intelligence: Reducing Hallucinations in Vision-Language Models

CV and Pattern Recognition

Fixes AI mistakes when describing pictures.

8 Dec 2025 0

92%

Conscious Gaze: Adaptive Attention Mechanisms for Hallucination Mitigation in Vision-Language Models

CV and Pattern Recognition

Makes AI see better, not just guess words.

5 Dec 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

13 pages

VEGAS: Mitigating Hallucinations in Large Vision-Language Models via Vision-Encoder Attention Guided Adaptive Steering

Makes AI see pictures better, less mistakes.

Technical Abstract

Diving into Mitigating Hallucinations from a Vision Perspective for Large Vision-Language Models

Toward More Reliable Artificial Intelligence: Reducing Hallucinations in Vision-Language Models

Conscious Gaze: Adaptive Attention Mechanisms for Hallucination Mitigation in Vision-Language Models