Context-Aware Decoding for Faithful Vision-Language Generation
By: Mehrdad Fazli, Bowen Wei, Ziwei Zhu
Potential Business Impact:
Stops AI from making up fake things about pictures.
Hallucinations, generating responses inconsistent with the visual input, remain a critical limitation of large vision-language models (LVLMs), especially in open-ended tasks such as image captioning and visual reasoning. In this work, we probe the layer-wise generation dynamics that drive hallucinations and propose a training-free mitigation strategy. Employing the Logit Lens, we examine how LVLMs construct next-token distributions across decoder layers, uncovering a pronounced commitment-depth gap: truthful tokens accumulate probability mass on their final candidates earlier than hallucinatory ones. Drawing on this discovery, we introduce Context Embedding Injection (CEI), a lightweight method that harnesses the hidden state of the last input token-the context embedding-as a grounding signal to maintain visual fidelity throughout decoding and curb hallucinations. Evaluated on the CHAIR, AMBER, and MMHal-Bench benchmarks (with a maximum token length of 512), CEI outperforms state-of-the-art baselines across three LVLMs, with its dynamic variant yielding the lowest overall hallucination rates. By integrating novel mechanistic insights with a scalable intervention, this work advances the mitigation of hallucinations in LVLMs.
Similar Papers
Diving into Mitigating Hallucinations from a Vision Perspective for Large Vision-Language Models
CV and Pattern Recognition
Fixes AI mistakes when describing pictures.
VEGAS: Mitigating Hallucinations in Large Vision-Language Models via Vision-Encoder Attention Guided Adaptive Steering
CV and Pattern Recognition
Makes AI see pictures better, less mistakes.
Exploring Causes and Mitigation of Hallucinations in Large Vision Language Models
CV and Pattern Recognition
Stops computers from making up fake things in pictures.