Score: 1

Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

Published: July 12, 2025 | arXiv ID: 2507.09205v4

By: Leiyu Pan , Bojian Xiong , Lei Yang and more

Potential Business Impact:

Makes computers understand and write Tibetan better.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model to enhance its generative capabilities in Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that our model consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.

Repos / Data Links

Page Count
13 pages

Category
Computer Science:
Computation and Language