Not All Tokens and Heads Are Equally Important: Dual-Level Attention Intervention for Hallucination Mitigation
By: Lexiang Tang , Xianwei Zhuang , Bang Yang and more
Potential Business Impact:
Fixes AI's mistakes when it describes pictures.
Large vision-language models (LVLMs) have demonstrated impressive capabilities across diverse multimodal tasks, yet they remain highly susceptible to visual hallucinations (VH), often producing confident but inaccurate descriptions of visual content. Building on the insight that not all tokens and attention heads contribute equally to VH mitigation, we introduce VisFlow, a lightweight and training-free framework that alleviates hallucinations by directly modulating attention patterns during inference. To address two primary challenges of VH, namely insufficient visual attention and the dominance of language priors, we identify three problematic attention behaviors in LVLMs: (1) disproportionate allocation of attention to uninformative or trailing visual tokens, (2) over-dependence on the previously generated token, and (3) excessive fixation on system prompts that hinders multimodal integration. To overcome these issues, VisFlow introduces a dual-level Attention Intervention, consisting of Token-level Attention Intervention (TAI), which reinforces attention to salient visual regions, and Head-level Attention Intervention (HAI), which suppresses undue focus on system prompts and adjacent text tokens. Together, these interventions strengthen visual alignment while reducing linguistic bias. Extensive experiments across diverse models and benchmarks demonstrate that VisFlow effectively mitigates hallucinations with minimal computational overhead.
Similar Papers
Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models
CV and Pattern Recognition
Makes AI describe pictures without making things up.
Attention Hijackers: Detect and Disentangle Attention Hijacking in LVLMs for Hallucination Mitigation
CV and Pattern Recognition
Stops AI from making up fake details about pictures.
Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow
Multiagent Systems
Fixes AI mistakes when talking about pictures.