Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study
By: Eeham Khan , Firas Saidani , Owen Van Esbroeck and more
Potential Business Impact:
Helps computers understand rare languages better.
Despite the widespread adoption of large language models (LLMs), their strongest capabilities remain largely confined to a small number of high-resource languages for which there is abundant training data. Recently, continual pre-training (CPT) has emerged as a means to fine-tune these models to low-resource regional dialects. In this paper, we study the use of CPT for dialect learning under tight data and compute budgets. Using low-rank adaptation (LoRA) and compute-efficient continual pre-training, we adapt three LLMs to the Qu\'ebec French dialect using a very small dataset and benchmark them on the COLE suite. Our experiments demonstrate an improvement on the minority dialect benchmarks with minimal regression on the prestige language benchmarks with under 1% of model parameters updated. Analysis of the results demonstrate that gains are highly contingent on corpus composition. These findings indicate that CPT with parameter-efficient fine-tuning (PEFT) can narrow the dialect gap by providing cost-effective and sustainable language resource creation, expanding high-quality LLM access to minority linguistic communities. We release the first Qu\'ebec French LLMs on HuggingFace.
Similar Papers
Adapting Large Language Models to Low-Resource Tibetan: A Two-Stage Continual and Supervised Fine-Tuning Study
Computation and Language
Teaches computers to understand Tibetan language better.
Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources
Computation and Language
Makes AI understand many languages better.
Emergent Abilities of Large Language Models under Continued Pretraining for Language Adaptation
Computation and Language
Teaches computers new languages without English.