Score: 0

Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens

Published: September 3, 2025 | arXiv ID: 2509.03025v1

By: Sohee Kim , Soohyun Ryu , Joonhyung Park and more

Potential Business Impact:

Fixes AI that sees text in pictures.

Business Areas:

Visual Search Internet Services

Large Vision-Language Models (LVLMs) generate contextually relevant responses by jointly interpreting visual and textual inputs. However, our finding reveals they often mistakenly perceive text inputs lacking visual evidence as being part of the image, leading to erroneous responses. In light of this finding, we probe whether LVLMs possess an internal capability to determine if textual concepts are grounded in the image, and discover a specific subset of Feed-Forward Network (FFN) neurons, termed Visual Absence-aware (VA) neurons, that consistently signal the visual absence through a distinctive activation pattern. Leveraging these patterns, we develop a detection module that systematically classifies whether an input token is visually grounded. Guided by its prediction, we propose a method to refine the outputs by reinterpreting question prompts or replacing the detected absent tokens during generation. Extensive experiments show that our method effectively mitigates the models' tendency to falsely presume the visual presence of text input and its generality across various LVLMs.

Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens

CV and Pattern Recognition

Fixes AI that sees text in pictures.

3 Sep 2025 0

90%

Direct Visual Grounding by Directing Attention of Visual Tokens

CV and Pattern Recognition

Makes AI better at answering questions about pictures.

16 Nov 2025 0

90%

To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models

CV and Pattern Recognition

Finds important image parts for AI understanding.

9 Oct 2025 0

View PDF Login to Bookmark

Page Count

23 pages

Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens

Fixes AI that sees text in pictures.

Technical Abstract

Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens

Direct Visual Grounding by Directing Attention of Visual Tokens

To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models