Score: 1

TIB-STC: A Large-Scale Structured Tibetan Benchmark for Low-Resource Language Modeling

Published: March 24, 2025 | arXiv ID: 2503.18288v5

By: Cheng Huang , Fan Gao , Yutong Liu and more

Potential Business Impact:

Helps computers understand and write Tibetan language.

Business Areas:

Text Analytics Data and Analytics, Software

Advancement of large language models (LLMs) has brought transformative capabilities to NLP, but such progress remains unevenly distributed, especially for low-resource and culturally rich languages like Tibetan. In this paper, we present TIB-STC, the first large-scale, expert-curated, and multi-domain dataset specifically designed to support the development and evaluation of LLMs for the Tibetan language. Spanning over 11 billion tokens across literature, religion, medicine, law, and daily communication, TIB-STC preserves traditional grammar and stylistic richness. To validate its utility, we train a reference model, Sun-Shine, on TIB-STC through a three-stage pipeline involving pretraining, supervised fine-tuning, and preference optimization. Evaluation on TLUE Benchmark for Tibetan-specific tasks, including Ti-MMLU and Ti-SafetyBench, demonstrates the TIB-STC's effectiveness in enabling robust instruction-following and culturally aligned generation. We release TIB-STC to advance research in low-resource language modeling and promote inclusivity in multilingual NLP. All data are available: https://github.com/Vicentvankor/sun-shine.

TIBSTC-CoT: A Multi-Domain Instruction Dataset for Chain-of-Thought Reasoning in Language Models

Computation and Language

Helps computers understand and speak Tibetan language.

4 Aug 2025 2

89%

Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

Computation and Language

Makes computers understand and write Tibetan better.

12 Jul 2025 1

87%

CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs

Computation and Language

Tests Chinese AI better with new data.

7 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

11 pages

TIB-STC: A Large-Scale Structured Tibetan Benchmark for Low-Resource Language Modeling

Helps computers understand and write Tibetan language.

Technical Abstract

TIBSTC-CoT: A Multi-Domain Instruction Dataset for Chain-of-Thought Reasoning in Language Models

Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs