Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing
By: Aashish Dhawan , Christopher Driggers-Ellis , Christan Grant and more
Potential Business Impact:
Helps translate rare languages using smart computer tricks.
Low-resource indigenous languages often lack the parallel corpora required for effective neural machine translation (NMT). Synthetic data generation offers a practical strategy for mitigating this limitation in data-scarce settings. In this work, we augment curated parallel datasets for indigenous languages of the Americas with synthetic sentence pairs generated using a high-capacity multilingual translation model. We fine-tune a multilingual mBART model on curated-only and synthetically augmented data and evaluate translation quality using chrF++, the primary metric used in recent AmericasNLP shared tasks for agglutinative languages. We further apply language-specific preprocessing, including orthographic normalization and noise-aware filtering, to reduce corpus artifacts. Experiments on Guarani--Spanish and Quechua--Spanish translation show consistent chrF++ improvements from synthetic data augmentation, while diagnostic experiments on Aymara highlight the limitations of generic preprocessing for highly agglutinative languages.
Similar Papers
BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages
Computation and Language
Creates better AI for languages with less data.
BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages
Computation and Language
Creates better computer brains for many languages.
The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages
Computation and Language
Helps computers understand many Indian languages better.