Score: 2

Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages

Published: October 8, 2025 | arXiv ID: 2510.07000v1

By: Neel Prabhanjan Rachamalla , Aravind Konakalla , Gautam Rajeev and more

Potential Business Impact:

Makes AI understand Indian languages and cultures better.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

The effectiveness of Large Language Models (LLMs) depends heavily on the availability of high-quality post-training data, particularly instruction-tuning and preference-based examples. Existing open-source datasets, however, often lack multilingual coverage, cultural grounding, and suffer from task diversity gaps that are especially pronounced for Indian languages. We introduce a human-in-the-loop pipeline that combines translations with synthetic expansion to produce reliable and diverse Indic post-training data. Using this pipeline, we curate two datasets: Pragyaan-IT (22.5K) and Pragyaan-Align (100K) across 10 Indian languages covering 13 broad and 56 sub-categories, leveraging 57 diverse datasets. Our dataset protocol incorporates several often-overlooked dimensions and emphasize task diversity, multi-turn dialogue, instruction fidelity, safety alignment, and preservation of cultural nuance, providing a foundation for more inclusive and effective multilingual LLMs.


Page Count
37 pages

Category
Computer Science:
Computation and Language