IndicRAGSuite: Large-Scale Datasets and a Benchmark for Indian Language RAG Systems

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the longstanding absence of evaluation benchmarks and high-quality multilingual training data for Retrieval-Augmented Generation (RAG) systems targeting Indian languages, this work introduces the first large-scale RAG resource suite for Indian languages. Methodologically, we (1) construct IndicMSMarco—a comprehensive end-to-end RAG benchmark covering 13 Indian languages with 1,000 human-annotated queries; and (2) release a massive multilingual training dataset, uniquely integrating Wikipedia corpora from 19 Indian languages with translation-augmented MS MARCO passages. We propose a novel dual-path construction paradigm combining human expert translation with LLM-assisted extraction and cross-lingual alignment to ensure high-fidelity cross-lingual question-answer-passage triplets. Empirical results demonstrate substantial improvements in retrieval accuracy and generation faithfulness—particularly for low-resource Indian languages such as Hindi. All resources are publicly released on Hugging Face.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) systems enable language models to access relevant information and generate accurate, well-grounded, and contextually informed responses. However, for Indian languages, the development of high-quality RAG systems is hindered by the lack of two critical resources: (1) evaluation benchmarks for retrieval and generation tasks, and (2) large-scale training datasets for multilingual retrieval. Most existing benchmarks and datasets are centered around English or high-resource languages, making it difficult to extend RAG capabilities to the diverse linguistic landscape of India. To address the lack of evaluation benchmarks, we create IndicMSMarco, a multilingual benchmark for evaluating retrieval quality and response generation in 13 Indian languages, created via manual translation of 1000 diverse queries from MS MARCO-dev set. To address the need for training data, we build a large-scale dataset of (question, answer, relevant passage) tuples derived from the Wikipedias of 19 Indian languages using state-of-the-art LLMs. Additionally, we include translated versions of the original MS MARCO dataset to further enrich the training data and ensure alignment with real-world information-seeking tasks. Resources are available here: https://huggingface.co/collections/ai4bharat/indicragsuite-683e7273cb2337208c8c0fcb
Problem

Research questions and friction points this paper is trying to address.

Lack of evaluation benchmarks for Indian language RAG systems
Missing large-scale training datasets for multilingual retrieval
Limited resources for extending RAG to diverse Indian languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created IndicMSMarco benchmark for 13 Indian languages
Built large-scale dataset from 19 Indian Wikipedias
Included translated MS MARCO for training enrichment
🔎 Similar Papers
No similar papers found.
P
Pasunuti Prasanjith
Nilekani Centre at AI4Bharat, Indian Institute of Technology Madras, India
P
Prathmesh B More
Nilekani Centre at AI4Bharat, Indian Institute of Technology Madras, India
Anoop Kunchukuttan
Anoop Kunchukuttan
Microsoft Translator, AI4Bharat
NLPMultilingual LearningInstruction TuningMTIndian language NLP
Raj Dabre
Raj Dabre
Researcher@NICT (Japan), Adjunct Faculty@IIT Madras/AI4Bharat (India)
Artificial IntelligenceMachine TranslationNatural Language ProcessingGenetics