Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara
By: Yacouba Diarra , Panga Azazia Kamate , Nouhoum Souleymane Coulibaly and more
Potential Business Impact:
Helps computers understand spoken Bambara language better.
We present Kunkado, a 160-hour Bambara ASR dataset compiled from Malian radio archives to capture present-day spontaneous speech across a wide range of topics. It includes code-switching, disfluencies, background noise, and overlapping speakers that practical ASR systems encounter in real-world use. We finetuned Parakeet-based models on a 33.47-hour human-reviewed subset and apply pragmatic transcript normalization to reduce variability in number formatting, tags, and code-switching annotations. Evaluated on two real-world test sets, finetuning with Kunkado reduces WER from 44.47\% to 37.12\% on one and from 36.07\% to 32.33\% on the other. In human evaluation, the resulting model also outperforms a comparable system with the same architecture trained on 98 hours of cleaner, less realistic speech. We release the data and models to support robust ASR for predominantly oral languages.
Similar Papers
Dealing with the Hard Facts of Low-Resource African NLP
Computation and Language
Helps computers understand a rare language.
Benchmarking Automatic Speech Recognition Models for African Languages
Computation and Language
Helps computers understand many African languages.
How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu
Computation and Language
Helps computers understand rare languages with less data.