Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training
By: Leiyu Pan , Bojian Xiong , Lei Yang and more
Potential Business Impact:
Makes computers understand and write Tibetan better.
Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model to enhance its generative capabilities in Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that our model consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.
Similar Papers
Adapting Large Language Models to Low-Resource Tibetan: A Two-Stage Continual and Supervised Fine-Tuning Study
Computation and Language
Teaches computers to understand Tibetan language better.
TIB-STC: A Large-Scale Structured Tibetan Benchmark for Low-Resource Language Modeling
Computation and Language
Helps computers understand and write Tibetan language.
Developing and Utilizing a Large-Scale Cantonese Dataset for Multi-Tasking in Large Language Models
Computation and Language
Teaches computers to understand Cantonese better.