Score: 1

How Do Vision-Language Models Process Conflicting Information Across Modalities?

Published: July 2, 2025 | arXiv ID: 2507.01790v1

By: Tianze Hua, Tian Yun, Ellie Pavlick

Potential Business Impact:

AI learns which information to trust when confused.

Business Areas:

Computer Vision Hardware, Software

AI models are increasingly required to be multimodal, integrating disparate input streams into a coherent state representation on which subsequent behaviors and actions can be based. This paper seeks to understand how such models behave when input streams present conflicting information. Focusing specifically on vision-language models, we provide inconsistent inputs (e.g., an image of a dog paired with the caption "A photo of a cat") and ask the model to report the information present in one of the specific modalities (e.g., "What does the caption say / What is in the image?"). We find that models often favor one modality over the other, e.g., reporting the image regardless of what the caption says, but that different models differ in which modality they favor. We find evidence that the behaviorally preferred modality is evident in the internal representational structure of the model, and that specific attention heads can restructure the representations to favor one modality over the other. Moreover, we find modality-agnostic "router heads" which appear to promote answers about the modality requested in the instruction, and which can be manipulated or transferred in order to improve performance across datasets and modalities. Together, the work provides essential steps towards identifying and controlling if and how models detect and resolve conflicting signals within complex multimodal environments.

Challenges in Understanding Modality Conflict in Vision-Language Models

Machine Learning (CS)

Helps computers understand when pictures and words disagree.

2 Sep 2025 1

90%

When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models

CV and Pattern Recognition

Fixes AI mistakes by showing what it sees.

18 Jul 2025 1

90%

Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization

Sound

AI learns to trust sound over pictures.

16 May 2025 2

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com

Page Count

34 pages

How Do Vision-Language Models Process Conflicting Information Across Modalities?

AI learns which information to trust when confused.

Technical Abstract

Challenges in Understanding Modality Conflict in Vision-Language Models

When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models

Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization