MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network
By: Vrushank Ahire , Kunal Shah , Mudasir Nazir Khan and more
Potential Business Impact:
Helps computers understand emotions from faces, voices, and words.
Dynamic emotion recognition in the wild remains challenging due to the transient nature of emotional expressions and temporal misalignment of multi-modal cues. Traditional approaches predict valence and arousal and often overlook the inherent correlation between these two dimensions. The proposed Multi-modal Attention for Valence-Arousal Emotion Network (MAVEN) integrates visual, audio, and textual modalities through a bi-directional cross-modal attention mechanism. MAVEN uses modality-specific encoders to extract features from synchronized video frames, audio segments, and transcripts, predicting emotions in polar coordinates following Russell's circumplex model. The evaluation of the Aff-Wild2 dataset using MAVEN achieved a concordance correlation coefficient (CCC) of 0.3061, surpassing the ResNet-50 baseline model with a CCC of 0.22. The multistage architecture captures the subtle and transient nature of emotional expressions in conversational videos and improves emotion recognition in real-world situations. The code is available at: https://github.com/Vrushank-Ahire/MAVEN_8th_ABAW
Similar Papers
Interactive Multimodal Fusion with Temporal Modeling
CV and Pattern Recognition
Lets computers guess your feelings from faces and voices.
Feature-Based Dual Visual Feature Extraction Model for Compound Multimodal Emotion Recognition
CV and Pattern Recognition
Helps computers understand emotions from faces and voices.
Mamba-VA: A Mamba-based Approach for Continuous Emotion Recognition in Valence-Arousal Space
CV and Pattern Recognition
Reads emotions from videos to help computers understand feelings.