Score: 0

Unveiling Intrinsic Text Bias in Multimodal Large Language Models through Attention Key-Space Analysis

Published: October 30, 2025 | arXiv ID: 2510.26721v1

By: Xinhan Zheng , Huyu Wu , Xueting Wang and more

Potential Business Impact:

Fixes AI's bias towards words over pictures.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Multimodal large language models (MLLMs) exhibit a pronounced preference for textual inputs when processing vision-language data, limiting their ability to reason effectively from visual evidence. Unlike prior studies that attribute this text bias to external factors such as data imbalance or instruction tuning, we propose that the bias originates from the model's internal architecture. Specifically, we hypothesize that visual key vectors (Visual Keys) are out-of-distribution (OOD) relative to the text key space learned during language-only pretraining. Consequently, these visual keys receive systematically lower similarity scores during attention computation, leading to their under-utilization in the context representation. To validate this hypothesis, we extract key vectors from LLaVA and Qwen2.5-VL and analyze their distributional structures using qualitative (t-SNE) and quantitative (Jensen-Shannon divergence) methods. The results provide direct evidence that visual and textual keys occupy markedly distinct subspaces within the attention space. The inter-modal divergence is statistically significant, exceeding intra-modal variation by several orders of magnitude. These findings reveal that text bias arises from an intrinsic misalignment within the attention key space rather than solely from external data factors.

See What You Are Told: Visual Attention Sink in Large Multimodal Models

CV and Pattern Recognition

Makes AI better at looking at pictures.

5 Mar 2025 0

89%

When Language Overrules: Revealing Text Dominance in Multimodal Large Language Models

Computation and Language

Makes AI use all senses, not just reading.

14 Aug 2025 1

89%

Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis

CV and Pattern Recognition

Checks if AI truly sees pictures, not just guesses.

15 May 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

6 pages

Unveiling Intrinsic Text Bias in Multimodal Large Language Models through Attention Key-Space Analysis

Fixes AI's bias towards words over pictures.

Technical Abstract

See What You Are Told: Visual Attention Sink in Large Multimodal Models

When Language Overrules: Revealing Text Dominance in Multimodal Large Language Models

Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis