🤖 AI Summary
This work addresses the critical need for query-level, label-free evaluation of ranking quality, as retrieval effectiveness varies significantly across queries. It presents the first systematic investigation into the self-evaluation capabilities of large language model (LLM) rerankers, introducing two training-free approaches—self-consistency and verbalized confidence—that achieve high-quality, well-calibrated performance prediction using only a few output tokens. To mitigate the overconfidence often exhibited by verbalized confidence estimates, the study further proposes lightweight supervised fine-tuning strategies, Verb-Num and Verb-List. Experimental results on the TREC Deep Learning 2019–2022 benchmarks demonstrate that the self-consistency method attains state-of-the-art prediction performance with superior calibration, while the proposed fine-tuning strategies substantially enhance the reliability of confidence estimates.
📝 Abstract
Retrieval effectiveness varies substantially across queries, making it important to estimate ranking quality before relevance judgments are available. Query performance prediction (QPP) addresses this need, but most existing methods rely on external predictors after retrieval or reranking. In this paper, we study \textit{reranker-internal QPP}: can an LLM reranker estimate the quality of the ranking it has just produced? We investigate both training-free and training-based approaches. For training-free estimation, we examine metric-specific self-consistency across sampled rankings and verbalized confidence produced directly by the reranker. Experiments on TREC Deep Learning 2019--2022 with four LLMs show that self-consistency is competitive with the state-of-the-art (SOTA) approach and better calibrated in almost all settings, while direct verbalized confidence is severely overconfident. To improve verbalized confidence, we propose two supervised methods, Verb-Num and Verb-List, which enable LLM rerankers to produce calibrated ranking-quality estimates with only a few additional output tokens.