🤖 AI Summary
Existing open-source multilingual datasets suffer from three critical limitations for Indian languages: insufficient language coverage, weak cultural adaptation, and low task diversity. To address these, we propose a human-in-the-loop data construction paradigm integrating expert translation, synthetic generation, and multi-source aggregation—emphasizing task diversity (13 coarse-grained and 56 fine-grained categories), multi-turn dialogue, instruction fidelity, safety alignment, and culturally nuanced expressions. Leveraging this approach, we construct and publicly release two high-quality Indian-language datasets: Pragyaan-IT (22.5K samples, optimized for instruction tuning) and Pragyaan-Align (100K samples, designed for preference alignment), covering ten major Indian languages. Empirical evaluation demonstrates that models trained on our data achieve substantial gains in both linguistic performance and cultural appropriateness across Indian languages. These datasets constitute foundational infrastructure for inclusive, culturally aware multilingual AI systems.
📝 Abstract
The effectiveness of Large Language Models (LLMs) depends heavily on the availability of high-quality post-training data, particularly instruction-tuning and preference-based examples. Existing open-source datasets, however, often lack multilingual coverage, cultural grounding, and suffer from task diversity gaps that are especially pronounced for Indian languages. We introduce a human-in-the-loop pipeline that combines translations with synthetic expansion to produce reliable and diverse Indic post-training data. Using this pipeline, we curate two datasets: Pragyaan-IT (22.5K) and Pragyaan-Align (100K) across 10 Indian languages covering 13 broad and 56 sub-categories, leveraging 57 diverse datasets. Our dataset protocol incorporates several often-overlooked dimensions and emphasize task diversity, multi-turn dialogue, instruction fidelity, safety alignment, and preservation of cultural nuance, providing a foundation for more inclusive and effective multilingual LLMs.