NeuCLIRBench: A Modern Evaluation Collection for Monolingual, Cross-Language, and Multilingual Information Retrieval

๐Ÿ“… 2025-11-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the lack of high-discriminative modern test collections for monolingual, cross-lingual, and multilingual information retrieval (IR). To this end, we construct a novel multilingual test collection covering Chinese, Persian, Russian, and English-translated documents. Innovatively integrating three years of TREC NeuCLIR topics, we introduceโ€” for the first timeโ€”a neural retrieval fusion baseline in the first-stage retrieval (replacing conventional BM25), substantially enhancing statistical discriminability. The multilingual document collection is built via machine translation, and relevance judgments are performed by human experts. The resulting dataset comprises approximately 150 queries and over 250,000 high-quality relevance annotations. All data and baseline models are publicly released, establishing a fair, statistically robust evaluation benchmark for multilingual IR systems.

Technology Category

Application Category

๐Ÿ“ Abstract
To measure advances in retrieval, test collections with relevance judgments that can faithfully distinguish systems are required. This paper presents NeuCLIRBench, an evaluation collection for cross-language and multilingual retrieval. The collection consists of documents written natively in Chinese, Persian, and Russian, as well as those same documents machine translated into English. The collection supports several retrieval scenarios including: monolingual retrieval in English, Chinese, Persian, or Russian; cross-language retrieval with English as the query language and one of the other three languages as the document language; and multilingual retrieval, again with English as the query language and relevant documents in all three languages. NeuCLIRBench combines the TREC NeuCLIR track topics of 2022, 2023, and 2024. The 250,128 judgments across approximately 150 queries for the monolingual and cross-language tasks and 100 queries for multilingual retrieval provide strong statistical discriminatory power to distinguish retrieval approaches. A fusion baseline of strong neural retrieval systems is included with the collection so that developers of reranking algorithms are no longer reliant on BM25 as their first-stage retriever. NeuCLIRBench is publicly available.
Problem

Research questions and friction points this paper is trying to address.

Evaluating cross-language and multilingual information retrieval systems
Providing test collections with relevance judgments for system distinction
Supporting monolingual, cross-language, and multilingual retrieval scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Native documents in Chinese Persian Russian
Machine translated documents into English
Fusion baseline of neural retrieval systems
๐Ÿ”Ž Similar Papers
No similar papers found.