Scaling, Simplification, and Adaptation: Lessons from Pretraining on Machine-Translated Text
By: Dan John Velasco, Matthew Theodore Roque
Potential Business Impact:
Helps computers learn many languages with less text.
Most languages lack sufficient data for large-scale monolingual pretraining, creating a "data wall." Multilingual pretraining helps but is limited by language imbalance and the "curse of multilinguality." An alternative is to translate high-resource text with machine translation (MT), which raises three questions: (1) How does MT-derived data scale with model capacity? (2) Can source-side transformations (e.g., simplifying English with an LLM) improve generalization to native text? (3) How well do models pretrained on MT-derived data adapt when continually trained on limited native text? We investigate these questions by translating English into Indonesian and Tamil--two typologically distant, lower-resource languages--and pretraining GPT-2 models (124M-774M) on native or MT-derived corpora from raw and LLM-simplified English. We evaluate cross-entropy loss on native text, along with accuracy on syntactic probes and downstream tasks. Our results show that (1) MT-pretrained models benefit from scaling; (2) source-side simplification harms generalization to native text; and (3) adapting MT-pretrained models on native text often yields better performance than native-only models, even with less native data. However, tasks requiring cultural nuance (e.g., toxicity detection) demand more exposure to native data.
Similar Papers
Pretraining Strategies using Monolingual and Parallel Data for Low-Resource Machine Translation
Computation and Language
Helps computers translate rare languages better.
Domain-Adaptive Continued Pre-Training of Small Language Models
Computation and Language
Makes small AI smarter with less computer power.
Emergent Abilities of Large Language Models under Continued Pretraining for Language Adaptation
Computation and Language
Teaches computers new languages without English.