INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages

📅 2024-07-18
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of evaluation, benchmarking resources, and annotated data for question answering (QA) in low-resource Indian languages, this paper introduces Indic-QA—the first large-scale, context-aware multilingual QA benchmark covering 11 Indian languages. It integrates human-translated and Gemini-synthesized (with human verification) extractive and generative QA pairs. We propose a “large language model (LLM)-synthesis–human-refinement” data construction paradigm and establish a unified zero-shot and few-shot evaluation framework. Our systematic evaluation of mainstream multilingual LLMs reveals significantly lower performance on Indian languages compared to English—particularly pronounced for low-resource varieties. Indic-QA fills a critical gap in non-English, multilingual, context-driven QA evaluation, providing a reproducible, high-quality benchmark and methodological foundation for low-resource language NLP research.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable zero-shot and few-shot capabilities in unseen tasks, including context-grounded question answering (QA) in English. However, the evaluation of LLMs' capabilities in non-English languages for context-based QA is limited by the scarcity of benchmarks in non-English languages. To address this gap, we introduce Indic-QA, the largest publicly available context-grounded question-answering dataset for 11 major Indian languages from two language families. The dataset comprises both extractive and abstractive question-answering tasks and includes existing datasets as well as English QA datasets translated into Indian languages. Additionally, we generate a synthetic dataset using the Gemini model to create question-answer pairs given a passage, which is then manually verified for quality assurance. We evaluate various multilingual Large Language Models and their instruction-fine-tuned variants on the benchmark and observe that their performance is subpar, particularly for low-resource languages. We hope that the release of this dataset will stimulate further research on the question-answering abilities of LLMs for low-resource languages.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs in Indic languages
Addressing English bias in LLMs
Improving QA in low-resource languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual Benchmark for Indic Languages
Translate Test Paradigm for Low Resource
Evaluation of Instruction Finetuned LLMs
🔎 Similar Papers
No similar papers found.