Score: 1

M4SER: Multimodal, Multirepresentation, Multitask, and Multistrategy Learning for Speech Emotion Recognition

Published: September 23, 2025 | arXiv ID: 2509.18706v1

By: Jiajun He , Xiaohan Shi , Cheng-Hung Hu and more

Potential Business Impact:

Helps computers understand feelings from voices better.

Business Areas:

Speech Recognition Data and Analytics, Software

Multimodal speech emotion recognition (SER) has emerged as pivotal for improving human-machine interaction. Researchers are increasingly leveraging both speech and textual information obtained through automatic speech recognition (ASR) to comprehensively recognize emotional states from speakers. Although this approach reduces reliance on human-annotated text data, ASR errors possibly degrade emotion recognition performance. To address this challenge, in our previous work, we introduced two auxiliary tasks, namely, ASR error detection and ASR error correction, and we proposed a novel multimodal fusion (MF) method for learning modality-specific and modality-invariant representations across different modalities. Building on this foundation, in this paper, we introduce two additional training strategies. First, we propose an adversarial network to enhance the diversity of modality-specific representations. Second, we introduce a label-based contrastive learning strategy to better capture emotional features. We refer to our proposed method as M4SER and validate its superiority over state-of-the-art methods through extensive experiments using IEMOCAP and MELD datasets.

MSF-SER: Enriching Acoustic Modeling with Multi-Granularity Semantics for Speech Emotion Recognition

Sound

Helps computers understand feelings in voices better.

7 Oct 2025 0

90%

Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention

Sound

Helps computers understand how people feel from their voice.

4 Dec 2025 0

90%

Developing a High-performance Framework for Speech Emotion Recognition in Naturalistic Conditions Challenge for Emotional Attribute Prediction

Machine Learning (CS)

Helps computers understand emotions in people's voices.

12 Jun 2025 1

View PDF Login to Bookmark

Country of Origin

🇯🇵 Japan

Page Count

16 pages

M4SER: Multimodal, Multirepresentation, Multitask, and Multistrategy Learning for Speech Emotion Recognition

Helps computers understand feelings from voices better.

Technical Abstract

MSF-SER: Enriching Acoustic Modeling with Multi-Granularity Semantics for Speech Emotion Recognition

Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention

Developing a High-performance Framework for Speech Emotion Recognition in Naturalistic Conditions Challenge for Emotional Attribute Prediction