Scaling Low-Resource MT via Synthetic Data Generation with LLMs
By: Ona de Gibert , Joseph Attieh , Teemu Vahtola and more
Potential Business Impact:
Helps computers translate rare languages better.
We investigate the potential of LLM-generated synthetic data for improving low-resource Machine Translation (MT). Focusing on seven diverse target languages, we construct a document-level synthetic corpus from English Europarl, and extend it via pivoting to 147 additional language pairs. Automatic and human evaluation confirm its overall high quality. We study its practical application by (i) identifying effective training regimes, (ii) comparing our data with the HPLT dataset, (iii) studying the effect of varying training data size, and (iiii) testing its utility beyond English-centric MT. Finally, we introduce SynOPUS, a public repository for synthetic parallel datasets. Our findings show that LLM-generated synthetic data, even when noisy, can substantially improve MT performance for low-resource languages.
Similar Papers
A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages
Computation and Language
Makes small AI learn languages better with smart text.
Scaling Laws of Synthetic Data for Language Models
Computation and Language
Creates endless smart computer learning material.
Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation
Computation and Language
Helps computers translate rare languages better.