Semantic Matters: Multimodal Features for Affective Analysis
By: Tobias Hallmen , Robin-Nico Kampa , Fabian Deuser and more
Potential Business Impact:
Helps computers understand emotions from voice and face.
In this study, we present our methodology for two tasks: the Emotional Mimicry Intensity (EMI) Estimation Challenge and the Behavioural Ambivalence/Hesitancy (BAH) Recognition Challenge, both conducted as part of the 8th Workshop and Competition on Affective & Behavior Analysis in-the-wild. We utilize a Wav2Vec 2.0 model pre-trained on a large podcast dataset to extract various audio features, capturing both linguistic and paralinguistic information. Our approach incorporates a valence-arousal-dominance (VAD) module derived from Wav2Vec 2.0, a BERT text encoder, and a vision transformer (ViT) with predictions subsequently processed through a long short-term memory (LSTM) architecture or a convolution-like method for temporal modeling. We integrate the textual and visual modality into our analysis, recognizing that semantic content provides valuable contextual cues and underscoring that the meaning of speech often conveys more critical insights than its acoustic counterpart alone. Fusing in the vision modality helps in some cases to interpret the textual modality more precisely. This combined approach results in significant performance improvements, achieving in EMI $\rho_{\text{TEST}} = 0.706$ and in BAH $F1_{\text{TEST}} = 0.702$, securing first place in the EMI challenge and second place in the BAH challenge.
Similar Papers
Technical Approach for the EMI Challenge in the 8th Affective Behavior Analysis in-the-Wild Competition
CV and Pattern Recognition
Helps computers understand emotions from faces and voices.
Interactive Multimodal Fusion with Temporal Modeling
CV and Pattern Recognition
Lets computers guess your feelings from faces and voices.
HSEmotion Team at ABAW-8 Competition: Audiovisual Ambivalence/Hesitancy, Emotional Mimicry Intensity and Facial Expression Recognition
CV and Pattern Recognition
Helps computers understand emotions from faces, voices, and words.