Score: 2

Scaling, Simplification, and Adaptation: Lessons from Pretraining on Machine-Translated Text

Published: September 22, 2025 | arXiv ID: 2509.17317v1

By: Dan John Velasco, Matthew Theodore Roque

BigTech Affiliations: Samsung

Potential Business Impact:

Helps computers learn many languages with less text.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Most languages lack sufficient data for large-scale monolingual pretraining, creating a "data wall." Multilingual pretraining helps but is limited by language imbalance and the "curse of multilinguality." An alternative is to translate high-resource text with machine translation (MT), which raises three questions: (1) How does MT-derived data scale with model capacity? (2) Can source-side transformations (e.g., simplifying English with an LLM) improve generalization to native text? (3) How well do models pretrained on MT-derived data adapt when continually trained on limited native text? We investigate these questions by translating English into Indonesian and Tamil--two typologically distant, lower-resource languages--and pretraining GPT-2 models (124M-774M) on native or MT-derived corpora from raw and LLM-simplified English. We evaluate cross-entropy loss on native text, along with accuracy on syntactic probes and downstream tasks. Our results show that (1) MT-pretrained models benefit from scaling; (2) source-side simplification harms generalization to native text; and (3) adapting MT-pretrained models on native text often yields better performance than native-only models, even with less native data. However, tasks requiring cultural nuance (e.g., toxicity detection) demand more exposure to native data.

Pretraining Strategies using Monolingual and Parallel Data for Low-Resource Machine Translation

Computation and Language

Helps computers translate rare languages better.

29 Oct 2025 2

89%

Domain-Adaptive Continued Pre-Training of Small Language Models

Computation and Language

Makes small AI smarter with less computer power.

13 Apr 2025 1

89%

Emergent Abilities of Large Language Models under Continued Pretraining for Language Adaptation

Computation and Language

Teaches computers new languages without English.

30 May 2025 0

View PDF Login to Bookmark

Country of Origin

🇰🇷 South Korea

Repos / Data Links

github.com

Page Count

19 pages

Scaling, Simplification, and Adaptation: Lessons from Pretraining on Machine-Translated Text

Helps computers learn many languages with less text.

Technical Abstract

Pretraining Strategies using Monolingual and Parallel Data for Low-Resource Machine Translation

Domain-Adaptive Continued Pre-Training of Small Language Models

Emergent Abilities of Large Language Models under Continued Pretraining for Language Adaptation