Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-source multilingual datasets suffer from three critical limitations for Indian languages: insufficient language coverage, weak cultural adaptation, and low task diversity. To address these, we propose a human-in-the-loop data construction paradigm integrating expert translation, synthetic generation, and multi-source aggregation—emphasizing task diversity (13 coarse-grained and 56 fine-grained categories), multi-turn dialogue, instruction fidelity, safety alignment, and culturally nuanced expressions. Leveraging this approach, we construct and publicly release two high-quality Indian-language datasets: Pragyaan-IT (22.5K samples, optimized for instruction tuning) and Pragyaan-Align (100K samples, designed for preference alignment), covering ten major Indian languages. Empirical evaluation demonstrates that models trained on our data achieve substantial gains in both linguistic performance and cultural appropriateness across Indian languages. These datasets constitute foundational infrastructure for inclusive, culturally aware multilingual AI systems.

Technology Category

Application Category

📝 Abstract
The effectiveness of Large Language Models (LLMs) depends heavily on the availability of high-quality post-training data, particularly instruction-tuning and preference-based examples. Existing open-source datasets, however, often lack multilingual coverage, cultural grounding, and suffer from task diversity gaps that are especially pronounced for Indian languages. We introduce a human-in-the-loop pipeline that combines translations with synthetic expansion to produce reliable and diverse Indic post-training data. Using this pipeline, we curate two datasets: Pragyaan-IT (22.5K) and Pragyaan-Align (100K) across 10 Indian languages covering 13 broad and 56 sub-categories, leveraging 57 diverse datasets. Our dataset protocol incorporates several often-overlooked dimensions and emphasize task diversity, multi-turn dialogue, instruction fidelity, safety alignment, and preservation of cultural nuance, providing a foundation for more inclusive and effective multilingual LLMs.
Problem

Research questions and friction points this paper is trying to address.

Addressing multilingual data gaps for Indian languages
Creating culturally grounded post-training datasets for LLMs
Improving task diversity and cultural nuance in datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-in-the-loop pipeline for data creation
Combines translation with synthetic data expansion
Emphasizes cultural nuance and task diversity
N
Neel Prabhanjan Rachamalla
Krutrim AI, Bangalore, India
A
Aravind Konakalla
Krutrim AI, Bangalore, India
G
Gautam Rajeev
Krutrim AI, Bangalore, India
Ashish Kulkarni
Ashish Kulkarni
Krutrim
Artificial intelligencemachine learningNatural Language Processing
Chandra Khatri
Chandra Khatri
Ola Krutrim AI
Artificial IntelligenceMulti-Modal AIConversational AIDeep LearningMachine Learning
S
Shubham Agarwal
Krutrim AI, Bangalore, India