V-FAT: Benchmarking Visual Fidelity Against Text-bias
By: Ziteng Wang , Yujie He , Guanliang Li and more
Potential Business Impact:
Helps AI truly see, not just guess words.
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard visual reasoning benchmarks. However, there is growing concern that these models rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon we term Text Bias. In this paper, we investigate the fundamental tension between visual perception and linguistic priors. We decouple the sources of this bias into two dimensions: Internal Corpus Bias, stemming from statistical correlations in pretraining, and External Instruction Bias, arising from the alignment-induced tendency toward sycophancy. To quantify this effect, we introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains. V-FAT employs a Three-Level Evaluation Framework that systematically increases the conflict between visual evidence and textual information: (L1) internal bias from atypical images, (L2) external bias from misleading instructions, and (L3) synergistic bias where both coincide. We introduce the Visual Robustness Score (VRS), a metric designed to penalize "lucky" linguistic guesses and reward true visual fidelity. Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.
Similar Papers
Words or Vision: Do Vision-Language Models Have Blind Faith in Text?
CV and Pattern Recognition
Computers trust words more than pictures.
Demystifying the Visual Quality Paradox in Multimodal Large Language Models
CV and Pattern Recognition
Makes AI understand pictures better, even blurry ones.
Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations
Computation and Language
Checks if AI explanations for X-rays are truthful.