Benchmarking Sociolinguistic Diversity in Swahili NLP: A Taxonomy-Guided Approach
By: Kezia Oketch, John P. Lalor, Ahmed Abbasi
Potential Business Impact:
Makes computer language tools understand Swahili better.
We introduce the first taxonomy-guided evaluation of Swahili NLP, addressing gaps in sociolinguistic diversity. Drawing on health-related psychometric tasks, we collect a dataset of 2,170 free-text responses from Kenyan speakers. The data exhibits tribal influences, urban vernacular, code-mixing, and loanwords. We develop a structured taxonomy and use it as a lens for examining model prediction errors across pre-trained and instruction-tuned language models. Our findings advance culturally grounded evaluation frameworks and highlight the role of sociolinguistic variation in shaping model performance.
Similar Papers
Artificially Fluent: Swahili AI Performance Benchmarks Between English-Trained and Natively-Trained Datasets
Computation and Language
AI understands Swahili better when trained in Swahili.
Dealing with the Hard Facts of Low-Resource African NLP
Computation and Language
Helps computers understand a rare language.
The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP
Computation and Language
Helps computers understand many African languages.