Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance
By: Jingyi Chen , Zhimeng Guo , Jiyun Chun and more
Potential Business Impact:
Computers hear words, not feelings in voices.
Understanding emotion from speech requires sensitivity to both lexical and acoustic cues. However, it remains unclear whether large audio language models (LALMs) genuinely process acoustic information or rely primarily on lexical content. We present LISTEN (Lexical vs. Acoustic Speech Test for Emotion in Narratives), a controlled benchmark designed to disentangle lexical reliance from acoustic sensitivity in emotion understanding. Across evaluations of six state-of-the-art LALMs, we observe a consistent lexical dominance. Models predict "neutral" when lexical cues are neutral or absent, show limited gains under cue alignment, and fail to classify distinct emotions under cue conflict. In paralinguistic settings, performance approaches chance. These results indicate that current LALMs largely "transcribe" rather than "listen," relying heavily on lexical semantics while underutilizing acoustic cues. LISTEN offers a principled framework for assessing emotion understanding in multimodal models.
Similar Papers
On the Contribution of Lexical Features to Speech Emotion Recognition
Audio and Speech Processing
Lets computers understand feelings from words spoken.
MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making
Computation and Language
Voice changes how computers give medical advice.
Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs
Computation and Language
Helps computers understand feelings in voices.