Score: 1

TIB-STC: A Large-Scale Structured Tibetan Benchmark for Low-Resource Language Modeling

Published: March 24, 2025 | arXiv ID: 2503.18288v5

By: Cheng Huang , Fan Gao , Yutong Liu and more

Potential Business Impact:

Helps computers understand and write Tibetan language.

Business Areas:
Text Analytics Data and Analytics, Software

Advancement of large language models (LLMs) has brought transformative capabilities to NLP, but such progress remains unevenly distributed, especially for low-resource and culturally rich languages like Tibetan. In this paper, we present TIB-STC, the first large-scale, expert-curated, and multi-domain dataset specifically designed to support the development and evaluation of LLMs for the Tibetan language. Spanning over 11 billion tokens across literature, religion, medicine, law, and daily communication, TIB-STC preserves traditional grammar and stylistic richness. To validate its utility, we train a reference model, Sun-Shine, on TIB-STC through a three-stage pipeline involving pretraining, supervised fine-tuning, and preference optimization. Evaluation on TLUE Benchmark for Tibetan-specific tasks, including Ti-MMLU and Ti-SafetyBench, demonstrates the TIB-STC's effectiveness in enabling robust instruction-following and culturally aligned generation. We release TIB-STC to advance research in low-resource language modeling and promote inclusivity in multilingual NLP. All data are available: https://github.com/Vicentvankor/sun-shine.

Country of Origin
🇨🇳 China

Repos / Data Links

Page Count
11 pages

Category
Computer Science:
Computation and Language