How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the generalization capability of large language model (LLM)-based re-rankers in information retrieval, focusing on performance disparities—and their underlying causes—between LLMs and lightweight contextual or zero-shot methods on seen versus unseen queries. Method: Through controlled ablation experiments across TREC DL19/DL20, BEIR, and a newly constructed unseen-query benchmark, we compare 22 re-ranking approaches (40 variants total), quantifying for the first time the independent effects of training data overlap, model architecture, and computational efficiency on re-ranking performance. Results: LLM re-rankers exhibit strong performance on familiar queries but suffer from unstable generalization to novel queries. Notably, several lightweight models achieve superior efficiency and robustness—even outperforming LLMs on unseen queries. Our key contribution is identifying query novelty as a critical bottleneck for current re-ranking generalization, providing empirical evidence and methodological guidance for designing efficient, generalizable re-rankers.

Technology Category

Application Category

📝 Abstract
In this work, we present a systematic and comprehensive empirical evaluation of state-of-the-art reranking methods, encompassing large language model (LLM)-based, lightweight contextual, and zero-shot approaches, with respect to their performance in information retrieval tasks. We evaluate in total 22 methods, including 40 variants (depending on used LLM) across several established benchmarks, including TREC DL19, DL20, and BEIR, as well as a novel dataset designed to test queries unseen by pretrained models. Our primary goal is to determine, through controlled and fair comparisons, whether a performance disparity exists between LLM-based rerankers and their lightweight counterparts, particularly on novel queries, and to elucidate the underlying causes of any observed differences. To disentangle confounding factors, we analyze the effects of training data overlap, model architecture, and computational efficiency on reranking performance. Our findings indicate that while LLM-based rerankers demonstrate superior performance on familiar queries, their generalization ability to novel queries varies, with lightweight models offering comparable efficiency. We further identify that the novelty of queries significantly impacts reranking effectiveness, highlighting limitations in existing approaches. https://github.com/DataScienceUIBK/llm-reranking-generalization-study
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM-based rerankers versus lightweight models
Assessing generalization to novel queries in retrieval
Analyzing training data overlap and efficiency impacts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic evaluation of LLM-based reranking methods
Comparison across lightweight and zero-shot approaches
Analysis of generalization on novel query datasets