XDoGE: Multilingual Data Reweighting to Enhance Language Inclusivity in LLMs
By: Iñaki Lacunza , José Javier Saiz , Alexander Shvets and more
Potential Business Impact:
Helps computers understand many languages better.
Current large language models (LLMs) are trained on massive amounts of text data, primarily from a few dominant languages. Studies suggest that this over-reliance on high-resource languages, such as English, hampers LLM performance in mid- and low-resource languages. To mitigate this problem, we propose to (i) optimize the language distribution by training a small proxy model within a domain-reweighing DoGE algorithm that we extend to XDoGE for a multilingual setup, and (ii) rescale the data and train a full-size model with the established language weights either from scratch or within a continual pre-training phase (CPT). We target six languages possessing a variety of geographic and intra- and inter-language-family relations, namely, English and Spanish (high-resource), Portuguese and Catalan (mid-resource), Galician and Basque (low-resource). We experiment with Salamandra-2b, which is a promising model for these languages. We investigate the effects of substantial data repetition on minor languages and under-sampling on dominant languages using the IberoBench framework for quantitative evaluation. Finally, we release a new promising IberianLLM-7B-Instruct model centering on Iberian languages and English that we pretrained from scratch and further improved using CPT with the XDoGE weights.
Similar Papers
Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque
Computation and Language
Creates smart computer vision for rare languages.
Revisiting Multilingual Data Mixtures in Language Model Pretraining
Computation and Language
Makes computers understand many languages better.
Cross-Lingual Stability and Bias in Instruction-Tuned Language Models for Humanitarian NLP
Computation and Language
Helps find human rights abuses in any language.