🤖 AI Summary
Existing retrieval benchmarks predominantly focus on information-seeking queries, relying on keyword matching or shallow semantic similarity, thus failing to adequately evaluate models’ ability to retrieve relevant documents for complex, reasoning-intensive queries—such as those requiring functional logic comprehension or cross-disciplinary conceptual derivation. To address this gap, we introduce BRIGHT, the first benchmark explicitly designed for *reasoning-intensive retrieval*, comprising 1,384 real-world, complex queries spanning economics, psychology, mathematics, and programming. We formally define this paradigm and empirically validate its distinctiveness. Experimental results show that state-of-the-art retrieval models achieve only 18.3 nDCG@10 on BRIGHT—a 40.7-point drop relative to their MTEB performance—demonstrating a critical capability gap. Incorporating explicit query reasoning lifts performance to 30.5 (+12.2 points), and integrating retrieval-augmented generation (RAG) further improves downstream question-answering accuracy by over 6.6 points.
📝 Abstract
Existing retrieval benchmarks primarily consist of information-seeking queries (e.g., aggregated questions from search engines) where keyword or semantic-based retrieval is usually sufficient. However, many complex real-world queries require in-depth reasoning to identify relevant documents that go beyond surface form matching. For example, finding documentation for a coding question requires understanding the logic and syntax of the functions involved. To better benchmark retrieval on such challenging queries, we introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents. Our dataset consists of 1,384 real-world queries spanning diverse domains, such as economics, psychology, mathematics, and coding. These queries are drawn from naturally occurring and carefully curated human data. Extensive evaluation reveals that even state-of-the-art retrieval models perform poorly on BRIGHT. The leading model on the MTEB leaderboard (Muennighoff et al., 2023), which achieves a score of 59.0 nDCG@10, produces a score of nDCG@10 of 18.3 on BRIGHT. We show that incorporating explicit reasoning about the query improves retrieval performance by up to 12.2 points. Moreover, incorporating retrieved documents from the top-performing retriever boosts question-answering performance by over 6.6 points. We believe that BRIGHT paves the way for future research on retrieval systems in more realistic and challenging settings.