Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages
By: Neel Prabhanjan Rachamalla , Aravind Konakalla , Gautam Rajeev and more
Potential Business Impact:
Makes AI understand Indian languages and cultures better.
The effectiveness of Large Language Models (LLMs) depends heavily on the availability of high-quality post-training data, particularly instruction-tuning and preference-based examples. Existing open-source datasets, however, often lack multilingual coverage, cultural grounding, and suffer from task diversity gaps that are especially pronounced for Indian languages. We introduce a human-in-the-loop pipeline that combines translations with synthetic expansion to produce reliable and diverse Indic post-training data. Using this pipeline, we curate two datasets: Pragyaan-IT (22.5K) and Pragyaan-Align (100K) across 10 Indian languages covering 13 broad and 56 sub-categories, leveraging 57 diverse datasets. Our dataset protocol incorporates several often-overlooked dimensions and emphasize task diversity, multi-turn dialogue, instruction fidelity, safety alignment, and preservation of cultural nuance, providing a foundation for more inclusive and effective multilingual LLMs.
Similar Papers
BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages
Computation and Language
Creates better AI for languages with less data.
BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages
Computation and Language
Creates better computer brains for many languages.
Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance
Computation and Language
Helps computers understand Hindi and English better.