Lighting the Way for BRIGHT: Reproducible Baselines with Anserini, Pyserini, and RankLLM

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

This paper addresses the irreproducibility of BM25 retrieval results on the BRIGHT benchmark, caused by its nonstandard query-side term weighting—contrary to conventional document-side weighting. We systematically analyze how this implementation discrepancy adversely affects inference-intensive queries and retrieval-augmented generation (RAG) scenarios. To resolve this, we formalize “query-side BM25,” integrate it uniformly into the Anserini/Pyserini and RankLLM toolchains, and establish a fully reproducible end-to-end baseline encompassing sparse, dense, hybrid, and LLM-driven listwise re-ranking. Our contributions are threefold: (1) the first identification and clarification of the root cause of BM25 bias in BRIGHT; (2) a standardized, cross-framework implementation of query-side BM25; and (3) the first multi-paradigm retrieval baseline explicitly optimized for long queries, significantly improving retrieval consistency and practical utility in emerging applications such as RAG.

Technology Category

Application Category

📝 Abstract

The BRIGHT benchmark is a dataset consisting of reasoning-intensive queries over diverse domains. We explore retrieval results on BRIGHT using a range of retrieval techniques, including sparse, dense, and fusion methods, and establish reproducible baselines. We then apply listwise reranking with large language models (LLMs) to further investigate the impact of reranking on reasoning-intensive queries. These baselines are integrated into popular retrieval and reranking toolkits Anserini, Pyserini, and RankLLM, with two-click reproducibility that makes them easy to build upon and convenient for further development. While attempting to reproduce the results reported in the original BRIGHT paper, we find that the provided BM25 scores differ notably from those that we obtain using Anserini and Pyserini. We discover that this difference is due to BRIGHT's implementation of BM25, which applies BM25 on the query rather than using the standard bag-of-words approach, as in Anserini, to construct query vectors. This difference has become increasingly relevant due to the rise of longer queries, with BRIGHT's lengthy reasoning-intensive queries being a prime example, and further accentuated by the increasing usage of retrieval-augmented generation, where LLM prompts can grow to be much longer than ''traditional'' search engine queries. Our observation signifies that it may be time to reconsider BM25 approaches going forward in order to better accommodate emerging applications. To facilitate this, we integrate query-side BM25 into both Anserini and Pyserini.

Problem

Research questions and friction points this paper is trying to address.

Establishing reproducible baselines for reasoning-intensive queries

Investigating the impact of LLM reranking on complex queries

Addressing BM25 implementation differences for long queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using sparse, dense, fusion retrieval methods

Applying listwise reranking with large language models

Integrating query-side BM25 into toolkits

🔎 Similar Papers

OLMES: A Standard for Language Model Evaluations