Regularizing Learnable Feature Extraction for Automatic Speech Recognition
By: Peter Vieting , Maximilian Kannen , Benedikt Hilmes and more
Potential Business Impact:
Makes talking computers understand speech better.
Neural front-ends are an appealing alternative to traditional, fixed feature extraction pipelines for automatic speech recognition (ASR) systems since they can be directly trained to fit the acoustic model. However, their performance often falls short compared to classical methods, which we show is largely due to their increased susceptibility to overfitting. This work therefore investigates regularization methods for training ASR models with learnable feature extraction front-ends. First, we examine audio perturbation methods and show that larger relative improvements can be obtained for learnable features. Additionally, we identify two limitations in the standard use of SpecAugment for these front-ends and propose masking in the short time Fourier transform (STFT)-domain as a simple but effective modification to address these challenges. Finally, integrating both regularization approaches effectively closes the performance gap between traditional and learnable features.
Similar Papers
Unified Learnable 2D Convolutional Feature Extraction for ASR
Audio and Speech Processing
Makes speech recognition work better with less power.
Transfer Learning-Based Deep Residual Learning for Speech Recognition in Clean and Noisy Environments
Audio and Speech Processing
Makes talking computers understand voices better.
Regularized Federated Learning for Privacy-Preserving Dysarthric and Elderly Speech Recognition
Audio and Speech Processing
Helps computers understand speech from sick or old people.