Bimodal Connection Attention Fusion for Speech Emotion Recognition
By: Jiachen Luo , Huy Phan , Lin Wang and more
Potential Business Impact:
Helps computers understand feelings from voices and words.
Multi-modal emotion recognition is challenging due to the difficulty of extracting features that capture subtle emotional differences. Understanding multi-modal interactions and connections is key to building effective bimodal speech emotion recognition systems. In this work, we propose Bimodal Connection Attention Fusion (BCAF) method, which includes three main modules: the interactive connection network, the bimodal attention network, and the correlative attention network. The interactive connection network uses an encoder-decoder architecture to model modality connections between audio and text while leveraging modality-specific features. The bimodal attention network enhances semantic complementation and exploits intra- and inter-modal interactions. The correlative attention network reduces cross-modal noise and captures correlations between audio and text. Experiments on the MELD and IEMOCAP datasets demonstrate that the proposed BCAF method outperforms existing state-of-the-art baselines.
Similar Papers
Heterogeneous bimodal attention fusion for speech emotion recognition
Sound
Helps computers understand feelings from talking and sounds.
Rethinking Multimodal Sentiment Analysis: A High-Accuracy, Simplified Fusion Architecture
Computation and Language
Helps computers understand feelings from talking, seeing, and hearing.
Feature-Based Dual Visual Feature Extraction Model for Compound Multimodal Emotion Recognition
CV and Pattern Recognition
Helps computers understand emotions from faces and voices.