Not in Sync: Unveiling Temporal Bias in Audio Chat Models
By: Jiayu Yao , Shenghua Liu , Yiwei Wang and more
Potential Business Impact:
Fixes AI's timing mistakes in understanding sounds.
Large Audio Language Models (LALMs) are increasingly applied to audio understanding and multimodal reasoning, yet their ability to locate when events occur remains underexplored. We present the first systematic study of temporal bias in LALMs, revealing a key limitation in their timestamp prediction. For example, when asked "At which second does the lecturer introduce the key formula?", models often predict timestamps that are consistently earlier or later than the ground truth. Through controlled experiments on timestamped datasets, we find that temporal bias (i) is prevalent across datasets and models, (ii) increases with audio length - even accumulating to tens of seconds in extended recordings, and (iii) varies across event types and positions. We quantify this effect with the Temporal Bias Index (TBI), measuring systematic misalignment in predicted event timings, and complement it with a visualization framework. Our findings highlight a fundamental limitation in current LALMs and call for the development of temporally robust architectures.
Similar Papers
TimeAudio: Bridging Temporal Gaps in Large Audio-Language Models
Sound
Helps computers understand exact moments in audio.
When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models
Computation and Language
AI ignores sounds when text disagrees.
MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making
Computation and Language
Voice changes how computers give medical advice.