NagaNLP: Bootstrapping NLP for Low-Resource Nagamese Creole with Human-in-the-Loop Synthetic Data
By: Agniva Maiti, Manya Pandey, Murari Mandal
Potential Business Impact:
Helps computers understand a rare language.
The vast majority of the world's languages, particularly creoles like Nagamese, remain severely under-resourced in Natural Language Processing (NLP), creating a significant barrier to their representation in digital technology. This paper introduces NagaNLP, a comprehensive open-source toolkit for Nagamese, bootstrapped through a novel methodology that relies on LLM-driven but human-validated synthetic data generation. We detail a multi-stage pipeline where an expert-guided LLM (Gemini) generates a candidate corpus, which is then refined and annotated by native speakers. This synthetic-hybrid approach yielded a 10K pair conversational dataset and a high-quality annotated corpus for foundational tasks. To assess the effectiveness of our methodology, we trained both discriminative and generative models. Our fine-tuned XLM-RoBERTa-base model establishes a new benchmark for Nagamese, achieving a 93.81\% accuracy (0.90 F1-Macro) on Part-of-Speech tagging and a 0.75 F1-Macro on Named Entity Recognition, massively outperforming strong zero-shot baselines. Furthermore, we fine-tuned a Llama-3.2-3B Instruct model, named NagaLLaMA, which demonstrates superior performance on conversational tasks, achieving a Perplexity of 3.85, an order of magnitude improvement over its few-shot counterpart (96.76). We release the NagaNLP toolkit, including all datasets, models, and code, providing a foundational resource for a previously underserved language and a reproducible framework for reducing data scarcity in other low-resource contexts.
Similar Papers
Dealing with the Hard Facts of Low-Resource African NLP
Computation and Language
Helps computers understand a rare language.
Exploring NLP Benchmarks in an Extremely Low-Resource Setting
Computation and Language
Helps computers understand rare languages better.
Part-of-speech tagging for Nagamese Language using CRF
Computation and Language
Helps computers understand a new language.