Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?
By: Tawsif Tashwar Dipto , Azmol Hossain , Rubayet Sabbir Faruque and more
Potential Business Impact:
Helps computers understand different accents of a language.
Conventional research on speech recognition modeling relies on the canonical form for most low-resource languages while automatic speech recognition (ASR) for regional dialects is treated as a fine-tuning task. To investigate the effects of dialectal variations on ASR we develop a 78-hour annotated Bengali Speech-to-Text (STT) corpus named Ben-10. Investigation from linguistic and data-driven perspectives shows that speech foundation models struggle heavily in regional dialect ASR, both in zero-shot and fine-tuned settings. We observe that all deep learning methods struggle to model speech data under dialectal variations but dialect specific model training alleviates the issue. Our dataset also serves as a out of-distribution (OOD) resource for ASR modeling under constrained resources in ASR algorithms. The dataset and code developed for this project are publicly available
Similar Papers
RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects
Computation and Language
Helps computers understand different Bengali accents.
A Unified Denoising and Adaptation Framework for Self-Supervised Bengali Dialectal ASR
Sound
Helps computers understand Bengali speech, even with noise.
Benchmarking Automatic Speech Recognition Models for African Languages
Computation and Language
Helps computers understand many African languages.