Score: 1

BYOL: Bring Your Own Language Into LLMs

Published: January 15, 2026 | arXiv ID: 2601.10804v1

By: Syed Waqas Zamir , Wassim Hamidouche , Boulbaba Ben Amor and more

Potential Business Impact:

Helps computers understand and use more languages.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large Language Models (LLMs) exhibit strong multilingual capabilities, yet remain fundamentally constrained by the severe imbalance in global language resources. While over 7,000 languages are spoken worldwide, only a small subset (fewer than 100) has sufficient digital presence to meaningfully influence modern LLM training. This disparity leads to systematic underperformance, cultural misalignment, and limited accessibility for speakers of low-resource and extreme-low-resource languages. To address this gap, we introduce Bring Your Own Language (BYOL), a unified framework for scalable, language-aware LLM development tailored to each language's digital footprint. BYOL begins with a language resource classification that maps languages into four tiers (Extreme-Low, Low, Mid, High) using curated web-scale corpora, and uses this classification to select the appropriate integration pathway. For low-resource languages, we propose a full-stack data refinement and expansion pipeline that combines corpus cleaning, synthetic text generation, continual pretraining, and supervised finetuning. Applied to Chichewa and Maori, this pipeline yields language-specific LLMs that achieve approximately 12 percent average improvement over strong multilingual baselines across 12 benchmarks, while preserving English and multilingual capabilities via weight-space model merging. For extreme-low-resource languages, we introduce a translation-mediated inclusion pathway, and show on Inuktitut that a tailored machine translation system improves over a commercial baseline by 4 BLEU, enabling high-accuracy LLM access when direct language modeling is infeasible. Finally, we release human-translated versions of the Global MMLU-Lite benchmark in Chichewa, Maori, and Inuktitut, and make our codebase and models publicly available at https://github.com/microsoft/byol .

Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque

Computation and Language

Creates smart computer vision for rare languages.

12 Nov 2025 1

88%

Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation

Computation and Language

Helps computers translate rare languages better.

2 Apr 2025 0

88%

MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs

CV and Pattern Recognition

Helps computers understand languages and cultures better.

7 Aug 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com github.com

Page Count

42 pages

BYOL: Bring Your Own Language Into LLMs

Helps computers understand and use more languages.

Technical Abstract

Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque

Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation

MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs