Dealing with the Hard Facts of Low-Resource African NLP
By: Yacouba Diarra , Nouhoum Souleymane Coulibaly , Panga Azazia Kamaté and more
Potential Business Impact:
Helps computers understand a rare language.
Creating speech datasets, models, and evaluation frameworks for low-resource languages remains challenging given the lack of a broad base of pertinent experience to draw from. This paper reports on the field collection of 612 hours of spontaneous speech in Bambara, a low-resource West African language; the semi-automated annotation of that dataset with transcriptions; the creation of several monolingual ultra-compact and small models using the dataset; and the automatic and human evaluation of their output. We offer practical suggestions for data collection protocols, annotation, and model design, as well as evidence for the importance of performing human evaluation. In addition to the main dataset, multiple evaluation datasets, models, and code are made publicly available.
Similar Papers
The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP
Computation and Language
Helps computers understand many African languages.
Automatic Speech Recognition for African Low-Resource Languages: Challenges and Future Directions
Computation and Language
Helps computers understand African languages better.
The State of Large Language Models for African Languages: Progress and Challenges
Artificial Intelligence
Helps computers understand more African languages.