When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models
By: Cheng Wang , Gelei Deng , Xianglin Yang and more
Potential Business Impact:
AI ignores sounds when text disagrees.
Large Audio-Language Models (LALMs) are enhanced with audio perception capabilities, enabling them to effectively process and understand multimodal inputs that combine audio and text. However, their performance in handling conflicting information between audio and text modalities remains largely unexamined. This paper introduces MCR-BENCH, the first comprehensive benchmark specifically designed to evaluate how LALMs prioritize information when presented with inconsistent audio-text pairs. Through extensive evaluation across diverse audio understanding tasks, we reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input, frequently disregarding audio evidence. This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications. We further investigate the influencing factors of text bias, and explore mitigation strategies through supervised finetuning, and analyze model confidence patterns that reveal persistent overconfidence even with contradictory inputs. These findings underscore the need for improved modality balance during training and more sophisticated fusion mechanisms to enhance the robustness when handling conflicting multi-modal inputs. The project is available at https://github.com/WangCheng0116/MCR-BENCH.
Similar Papers
Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection
Computation and Language
Fixes computer understanding of mixed-up information.
MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making
Computation and Language
Voice changes how computers give medical advice.
Not in Sync: Unveiling Temporal Bias in Audio Chat Models
Computation and Language
Fixes AI's timing mistakes in understanding sounds.