An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR
By: Sewade Ogun, Vincent Colotte, Emmanuel Vincent
Potential Business Impact:
Makes voice assistants understand you better.
Augmenting the training data of automatic speech recognition (ASR) systems with synthetic data generated by text-to-speech (TTS) or voice conversion (VC) has gained popularity in recent years. Several works have demonstrated improvements in ASR performance using this augmentation approach. However, because of the lower diversity of synthetic speech, naively combining synthetic and real data often does not yield the best results. In this work, we leverage recently proposed flow-based TTS/VC models allowing greater speech diversity, and assess the respective impact of augmenting various speech attributes on the word error rate (WER) achieved by several ASR models. Pitch augmentation and VC-based speaker augmentation are found to be ineffective in our setup. Jointly augmenting all other attributes reduces the WER of a Conformer-Transducer model by 11\% relative on Common Voice and by up to 35\% relative on LibriSpeech compared to training on real data only.
Similar Papers
Frustratingly Easy Data Augmentation for Low-Resource ASR
Computation and Language
Makes talking computers understand rare languages better.
O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion
Sound
Changes your voice to sound like anyone.
Investigation of Zero-shot Text-to-Speech Models for Enhancing Short-Utterance Speaker Verification
Sound
Makes voice checks better with fake voices.