Score: 0

EmoAugNet: A Signal-Augmented Hybrid CNN-LSTM Framework for Speech Emotion Recognition

Published: August 6, 2025 | arXiv ID: 2508.06321v1

By: Durjoy Chandra Paul, Gaurob Saha, Md Amjad Hossain

Potential Business Impact:

Helps computers understand how you feel when you talk.

Recognizing emotional signals in speech has a significant impact on enhancing the effectiveness of human-computer interaction (HCI). This study introduces EmoAugNet, a hybrid deep learning framework, that incorporates Long Short-Term Memory (LSTM) layers with one-dimensional Convolutional Neural Networks (1D-CNN) to enable reliable Speech Emotion Recognition (SER). The quality and variety of the features that are taken from speech signals have a significant impact on how well SER systems perform. A comprehensive speech data augmentation strategy was used to combine both traditional methods, such as noise addition, pitch shifting, and time stretching, with a novel combination-based augmentation pipeline to enhance generalization and reduce overfitting. Each audio sample was transformed into a high-dimensional feature vector using root mean square energy (RMSE), Mel-frequency Cepstral Coefficient (MFCC), and zero-crossing rate (ZCR). Our model with ReLU activation has a weighted accuracy of 95.78\% and unweighted accuracy of 92.52\% on the IEMOCAP dataset and, with ELU activation, has a weighted accuracy of 96.75\% and unweighted accuracy of 91.28\%. On the RAVDESS dataset, we get a weighted accuracy of 94.53\% and 94.98\% unweighted accuracy for ReLU activation and 93.72\% weighted accuracy and 94.64\% unweighted accuracy for ELU activation. These results highlight EmoAugNet's effectiveness in improving the robustness and performance of SER systems through integated data augmentation and hybrid modeling.

Country of Origin
🇺🇸 United States

Page Count
5 pages

Category
Computer Science:
Sound