Enhancing Speech Emotion Recognition with Multi-Task Learning and Dynamic Feature Fusion
By: Honghong Wang , Jing Deng , Fanqin Meng and more
Potential Business Impact:
Helps computers understand feelings in voices better.
This study investigates fine-tuning self-supervised learn ing (SSL) models using multi-task learning (MTL) to enhance speech emotion recognition (SER). The framework simultane ously handles four related tasks: emotion recognition, gender recognition, speaker verification, and automatic speech recog nition. An innovative co-attention module is introduced to dy namically capture the interactions between features from the primary emotion classification task and auxiliary tasks, en abling context-aware fusion. Moreover, We introduce the Sam ple Weighted Focal Contrastive (SWFC) loss function to ad dress class imbalance and semantic confusion by adjusting sam ple weights for difficult and minority samples. The method is validated on the Categorical Emotion Recognition task of the Speech Emotion Recognition in Naturalistic Conditions Chal lenge, showing significant performance improvements.
Similar Papers
Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention
Sound
Helps computers understand how people feel from their voice.
M4SER: Multimodal, Multirepresentation, Multitask, and Multistrategy Learning for Speech Emotion Recognition
Human-Computer Interaction
Helps computers understand feelings from voices better.
MSF-SER: Enriching Acoustic Modeling with Multi-Granularity Semantics for Speech Emotion Recognition
Sound
Helps computers understand feelings in voices better.