Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis
By: Pengfei Wang , Guohai Xu , Weinong Wang and more
Potential Business Impact:
Checks if AI truly sees pictures, not just guesses.
Recent advancements have enhanced the capability of Multimodal Large Language Models (MLLMs) to comprehend multi-image information. However, existing benchmarks primarily evaluate answer correctness, overlooking whether models genuinely comprehend the visual input. To address this, we define implicit visual misunderstanding (IVM), where MLLMs provide correct answers without fully comprehending the visual input. Through our analysis, we decouple the visual and textual modalities within the causal attention module, revealing that attention distribution increasingly converges on the image associated with the correct answer as the network layers deepen. This insight leads to the introduction of a scale-agnostic metric, \textit{attention accuracy}, and a novel benchmark for quantifying IVMs. Attention accuracy directly evaluates the model's visual understanding via internal mechanisms, remaining robust to positional biases for more reliable assessments. Furthermore, we extend our approach to finer granularities and demonstrate its effectiveness in unimodal scenarios, underscoring its versatility and generalizability.
Similar Papers
Perceiving Beyond Language Priors: Enhancing Visual Comprehension and Attention in Multimodal Models
CV and Pattern Recognition
Helps computers truly understand pictures and words together.
Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models
CV and Pattern Recognition
Makes AI describe pictures without making things up.
Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs
CV and Pattern Recognition
Makes AI understand pictures and words better together.