Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing open-source multilingual datasets suffer from three critical limitations for Indian languages: insufficient language coverage, weak cultural adaptation, and low task diversity. To address these, we propose a human-in-the-loop data construction paradigm integrating expert translation, synthetic generation, and multi-source aggregation—emphasizing task diversity (13 coarse-grained and 56 fine-grained categories), multi-turn dialogue, instruction fidelity, safety alignment, and culturally nuanced expressions. Leveraging this approach, we construct and publicly release two high-quality Indian-language datasets: Pragyaan-IT (22.5K samples, optimized for instruction tuning) and Pragyaan-Align (100K samples, designed for preference alignment), covering ten major Indian languages. Empirical evaluation demonstrates that models trained on our data achieve substantial gains in both linguistic performance and cultural appropriateness across Indian languages. These datasets constitute foundational infrastructure for inclusive, culturally aware multilingual AI systems.

Technology Category

Application Category

📝 Abstract

The effectiveness of Large Language Models (LLMs) depends heavily on the availability of high-quality post-training data, particularly instruction-tuning and preference-based examples. Existing open-source datasets, however, often lack multilingual coverage, cultural grounding, and suffer from task diversity gaps that are especially pronounced for Indian languages. We introduce a human-in-the-loop pipeline that combines translations with synthetic expansion to produce reliable and diverse Indic post-training data. Using this pipeline, we curate two datasets: Pragyaan-IT (22.5K) and Pragyaan-Align (100K) across 10 Indian languages covering 13 broad and 56 sub-categories, leveraging 57 diverse datasets. Our dataset protocol incorporates several often-overlooked dimensions and emphasize task diversity, multi-turn dialogue, instruction fidelity, safety alignment, and preservation of cultural nuance, providing a foundation for more inclusive and effective multilingual LLMs.

Problem

Research questions and friction points this paper is trying to address.

Addressing multilingual data gaps for Indian languages

Creating culturally grounded post-training datasets for LLMs

Improving task diversity and cultural nuance in datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-in-the-loop pipeline for data creation

Combines translation with synthetic data expansion

Emphasizes cultural nuance and task diversity

🔎 Similar Papers

Self-Alignment: Improving Alignment of Cultural Values in LLMs via In-Context Learning

2024-08-29arXiv.orgCitations: 13

Culturally Aware and Adapted NLP: A Taxonomy and a Survey of the State of the Art

2024-06-06arXiv.orgCitations: 1

Authors to Follow