TIB-STC: A Large-Scale Structured Tibetan Benchmark for Low-Resource Language Modeling
By: Cheng Huang , Fan Gao , Yutong Liu and more
Potential Business Impact:
Helps computers understand and write Tibetan language.
Advancement of large language models (LLMs) has brought transformative capabilities to NLP, but such progress remains unevenly distributed, especially for low-resource and culturally rich languages like Tibetan. In this paper, we present TIB-STC, the first large-scale, expert-curated, and multi-domain dataset specifically designed to support the development and evaluation of LLMs for the Tibetan language. Spanning over 11 billion tokens across literature, religion, medicine, law, and daily communication, TIB-STC preserves traditional grammar and stylistic richness. To validate its utility, we train a reference model, Sun-Shine, on TIB-STC through a three-stage pipeline involving pretraining, supervised fine-tuning, and preference optimization. Evaluation on TLUE Benchmark for Tibetan-specific tasks, including Ti-MMLU and Ti-SafetyBench, demonstrates the TIB-STC's effectiveness in enabling robust instruction-following and culturally aligned generation. We release TIB-STC to advance research in low-resource language modeling and promote inclusivity in multilingual NLP. All data are available: https://github.com/Vicentvankor/sun-shine.
Similar Papers
TIBSTC-CoT: A Multi-Domain Instruction Dataset for Chain-of-Thought Reasoning in Language Models
Computation and Language
Helps computers understand and speak Tibetan language.
Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training
Computation and Language
Makes computers understand and write Tibetan better.
CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs
Computation and Language
Tests Chinese AI better with new data.