AdiBhashaa: A Community-Curated Benchmark for Machine Translation into Indian Tribal Languages
By: Pooja Singh, Sandeep Kumar
Potential Business Impact:
Helps tribal languages speak to computers.
Large language models and multilingual machine translation (MT) systems increasingly drive access to information, yet many languages of the tribal communities remain effectively invisible in these technologies. This invisibility exacerbates existing structural inequities in education, governance, and digital participation. We present AdiBhashaa, a community-driven initiative that constructs the first open parallel corpora and baseline MT systems for four major Indian tribal languages-Bhili, Mundari, Gondi, and Santali. This work combines participatory data creation with native speakers, human-in-the-loop validation, and systematic evaluation of both encoder-decoder MT models and large language models. In addition to reporting technical findings, we articulate how AdiBhashaa illustrates a possible model for more equitable AI research: it centers local expertise, builds capacity among early-career researchers from marginalized communities, and foregrounds human validation in the development of language technologies.
Similar Papers
BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains
Computation and Language
Tests AI on Indian knowledge in English and Hindi.
BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains
Computation and Language
Tests AI on India's specific knowledge.
Bhasha-Rupantarika: Algorithm-Hardware Co-design approach for Multilingual Neural Machine Translation
Hardware Architecture
Translates many languages on small devices.