Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization
By: Hsuan-Yu Wang, Pei-Ying Lee, Berlin Chen
Potential Business Impact:
Makes computers understand emotions in talking better.
In this paper, we investigate the impact of incorporating timestamp-based alignment between Automatic Speech Recognition (ASR) transcripts and Speaker Diarization (SD) outputs on Speech Emotion Recognition (SER) accuracy. Misalignment between these two modalities often reduces the reliability of multimodal emotion recognition systems, particularly in conversational contexts. To address this issue, we introduce an alignment pipeline utilizing pre-trained ASR and speaker diarization models, systematically synchronizing timestamps to generate accurately labeled speaker segments. Our multimodal approach combines textual embeddings extracted via RoBERTa with audio embeddings from Wav2Vec, leveraging cross-attention fusion enhanced by a gating mechanism. Experimental evaluations on the IEMOCAP benchmark dataset demonstrate that precise timestamp alignment improves SER accuracy, outperforming baseline methods that lack synchronization. The results highlight the critical importance of temporal alignment, demonstrating its effectiveness in enhancing overall emotion recognition accuracy and providing a foundation for robust multimodal emotion analysis.
Similar Papers
Improving Speech Emotion Recognition Through Cross Modal Attention Alignment and Balanced Stacking Model
Audio and Speech Processing
Helps computers understand how people feel when they talk.
Mobile-Efficient Speech Emotion Recognition Using DistilHuBERT: A Cross-Corpus Validation Study
Sound
Makes phones understand your feelings from your voice.
SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation
Audio and Speech Processing
Makes computers understand spoken words faster.