Benchmarking and Building Zero-Shot Hindi Retrieval Model with Hindi-BEIR and NLLB-E5

πŸ“… 2024-09-09
πŸ›οΈ North American Chapter of the Association for Computational Linguistics
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the lack of authoritative evaluation benchmarks and efficient zero-shot models for Hindi information retrieval, this paper introduces Hindi-BEIRβ€”the first comprehensive Hindi retrieval benchmark, covering 15 datasets across 7 task categories. We further propose NLLB-E5, a zero-shot multilingual retrieval model distilled from the NLLB encoder, which requires no Hindi-labeled data and integrates multilingual embedding alignment with the E5 retrieval paradigm. Experimental results show that NLLB-E5 achieves a 12.3% average improvement in NDCG@10 over prior methods on Hindi-BEIR, enabling, for the first time, high-performance, out-of-the-box zero-shot Hindi retrieval. This work breaks the long-standing dependency of low-resource language retrieval on target-language supervision, systematically characterizes performance bottlenecks across domains and tasks, and establishes both a new benchmark and a novel paradigm for multilingual retrieval research.

Technology Category

Application Category

πŸ“ Abstract
Given the large number of Hindi speakers worldwide, there is a pressing need for robust and efficient information retrieval systems for Hindi. Despite ongoing research, comprehensive benchmarks for evaluating retrieval models in Hindi are lacking. To address this gap, we introduce the Hindi-BEIR benchmark, comprising 15 datasets across seven distinct tasks. We evaluate state-of-the-art multilingual retrieval models on the Hindi-BEIR benchmark, identifying task and domain-specific challenges that impact Hindi retrieval performance. Building on the insights from these results, we introduce NLLB-E5, a multilingual retrieval model that leverages a zero-shot approach to support Hindi without the need for Hindi training data. We believe our contributions, which include the release of the Hindi-BEIR benchmark and the NLLB-E5 model, will prove to be a valuable resource for researchers and promote advancements in multilingual retrieval models.
Problem

Research questions and friction points this paper is trying to address.

Lack of Hindi retrieval benchmarks for evaluation
Challenges in Hindi task and domain performance
Need for zero-shot Hindi retrieval without training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced Hindi-BEIR benchmark for evaluation
Developed NLLB-E5 multilingual retrieval model
Zero-shot approach for Hindi without training data
πŸ”Ž Similar Papers
No similar papers found.
A
Arkadeep Acharya
Department of Computer Science and Engineering, Indian Institute of Technology Patna
Rudra Murthy
Rudra Murthy
Staff Research Scientist, IBM
Natural Language ProcessingDeep Learning
V
Vishwajeet Kumar
IBM Research
Jaydeep Sen
Jaydeep Sen
IBM Research AI
Question AnsweringInformation RetrievalNLP