Grow Up and Merge: Scaling Strategies for Efficient Language Adaptation
By: Kevin Glocker , Kätriin Kukk , Romina Oji and more
Potential Business Impact:
Makes computers understand many languages better.
Achieving high-performing language models which include medium- and lower-resource languages remains a challenge. Massively multilingual models still underperform compared to language-specific adaptations, especially at smaller model scales. In this work, we investigate scaling as an efficient strategy for adapting pretrained models to new target languages. Through comprehensive scaling ablations with approximately FLOP-matched models, we test whether upscaling an English base model enables more effective and resource-efficient adaptation than standard continued pretraining. We find that, once exposed to sufficient target-language data, larger upscaled models can match or surpass the performance of smaller models continually pretrained on much more data, demonstrating the benefits of scaling for data efficiency. Scaling also helps preserve the base model's capabilities in English, thus reducing catastrophic forgetting. Finally, we explore whether such scaled, language-specific models can be merged to construct modular and flexible multilingual systems. We find that while merging remains less effective than joint multilingual training, upscaled merges perform better than smaller ones. We observe large performance differences across merging methods, suggesting potential for improvement through merging approaches specialized for language-level integration.
Similar Papers
SCALE: Upscaled Continual Learning of Large Language Models
Computation and Language
Makes AI learn new things without forgetting old ones.
Scaling, Simplification, and Adaptation: Lessons from Pretraining on Machine-Translated Text
Computation and Language
Helps computers learn many languages with less text.
Scaling Performance of Large Language Model Pretraining
Distributed, Parallel, and Cluster Computing
Teaches computers to learn faster with less power.