Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech
By: Pedro Corrêa , João Lima , Victor Moreno and more
Potential Business Impact:
Computers hear emotions better, not just words.
Advancements in spoken language processing have driven the development of spoken language models (SLMs), designed to achieve universal audio understanding by jointly learning text and audio representations for a wide range of tasks. Although promising results have been achieved, there is growing discussion regarding these models' generalization capabilities and the extent to which they truly integrate audio and text modalities in their internal representations. In this work, we evaluate four SLMs on the task of speech emotion recognition using a dataset of emotionally incongruent speech samples, a condition under which the semantic content of the spoken utterance conveys one emotion while speech expressiveness conveys another. Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task, indicating that text-related representations largely dominate over acoustic representations. We release both the code and the Emotionally Incongruent Synthetic Speech dataset (EMIS) to the community.
Similar Papers
Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech
Computation and Language
Computers hear emotions better, not just words.
Dual Information Speech Language Models for Emotional Conversations
Computation and Language
Lets computers understand feelings in spoken words.
EmoSLLM: Parameter-Efficient Adaptation of LLMs for Speech Emotion Recognition
Audio and Speech Processing
Helps computers understand your feelings from your voice.