Modern Uyghur Dependency Treebank (MUDT): An Integrated Morphosyntactic Framework for a Low-Resource Language
By: Jiaxin Zuo , Yiquan Wang , Yuan Pan and more
Potential Business Impact:
Helps computers understand a rare language better.
To address a critical resource gap in Uyghur Natural Language Processing (NLP), this study introduces a dependency annotation framework designed to overcome the limitations of existing treebanks for the low-resource, agglutinative language. This inventory includes 18 main relations and 26 subtypes, with specific labels such as cop:zero for verbless clauses and instr:case=loc/dat for nuanced instrumental functions. To empirically validate the necessity of this tailored approach, we conducted a cross-standard evaluation using a pre-trained Universal Dependencies parser. The analysis revealed a systematic 47.9% divergence in annotations, pinpointing the inadequacy of universal schemes for handling Uyghur-specific structures. Grounded in nine annotation principles that ensure typological accuracy and semantic transparency, the Modern Uyghur Dependency Treebank (MUDT) provides a more accurate and semantically transparent representation, designed to enable significant improvements in parsing and downstream NLP tasks, and offers a replicable model for other morphologically complex languages.
Similar Papers
Second language Korean Universal Dependency treebank v1.2: Focus on data augmentation and annotation scheme refinement
Computation and Language
Makes computers understand Korean learned by non-native speakers.
The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project
Computation and Language
Helps computers understand the Tagalog language better.
UD-KSL Treebank v1.3: A semi-automated framework for aligning XPOS-extracted units with UPOS tags
Computation and Language
Helps computers understand Korean grammar better.