Score: 0

Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention

Published: December 4, 2025 | arXiv ID: 2512.04551v1

By: Cong Wang , Yizhong Geng , Yuhua Wen and more

Potential Business Impact:

Helps computers understand how people feel from their voice.

Business Areas:

Speech Recognition Data and Analytics, Software

Speech emotion recognition (SER) is an important technology in human-computer interaction. However, achieving high performance is challenging due to emotional complexity and scarce annotated data. To tackle these challenges, we propose a multi-loss learning (MLL) framework integrating an energy-adaptive mixup (EAM) method and a frame-level attention module (FLAM). The EAM method leverages SNR-based augmentation to generate diverse speech samples capturing subtle emotional variations. FLAM enhances frame-level feature extraction for multi-frame emotional cues. Our MLL strategy combines Kullback-Leibler divergence, focal, center, and supervised contrastive loss to optimize learning, address class imbalance, and improve feature separability. We evaluate our method on four widely used SER datasets: IEMOCAP, MSP-IMPROV, RAVDESS, and SAVEE. The results demonstrate our method achieves state-of-the-art performance, suggesting its effectiveness and robustness.

Enhancing Speech Emotion Recognition with Multi-Task Learning and Dynamic Feature Fusion

Sound

Helps computers understand feelings in voices better.

25 Aug 2025 0

90%

M4SER: Multimodal, Multirepresentation, Multitask, and Multistrategy Learning for Speech Emotion Recognition

Human-Computer Interaction

Helps computers understand feelings from voices better.

23 Sep 2025 1

89%

Amplifying Emotional Signals: Data-Efficient Deep Learning for Robust Speech Emotion Recognition

Audio and Speech Processing

Helps computers understand your feelings from your voice.

26 Aug 2025 1

View PDF Login to Bookmark

Page Count

5 pages

Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention

Helps computers understand how people feel from their voice.

Technical Abstract

Enhancing Speech Emotion Recognition with Multi-Task Learning and Dynamic Feature Fusion

M4SER: Multimodal, Multirepresentation, Multitask, and Multistrategy Learning for Speech Emotion Recognition

Amplifying Emotional Signals: Data-Efficient Deep Learning for Robust Speech Emotion Recognition