Uncovering the Limitations of Query Performance Prediction: Failures, Insights, and Implications for Selective Query Processing

📅 2025-04-01

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Query Performance Prediction (QPP) suffers from poor generalization across retrieval paradigms and datasets, severely limiting its practical utility in downstream tasks such as selective query processing. This work presents the first systematic evaluation of mainstream QPP methods—including LETOR-based features, NQC/UQC, and dense predictors—across diverse retrievers (BM25, DFree, SPLADE, ColBERT) and four major test collections (ROBUST, GOV2, WT10G, MS MARCO). Key findings reveal that collection bias is the primary cause of QPP failure; both sparse and dense predictors lack robustness across datasets and retrieval architectures; prediction accuracy degrades significantly on unseen collections—especially on MS MARCO; and QPP-guided selective processing yields less than 1% MAP improvement. These results expose fundamental methodological limitations of current QPP approaches and establish an empirical benchmark and concrete direction for designing generalizable QPP models.

Technology Category

Application Category

📝 Abstract

Query Performance Prediction (QPP) estimates retrieval systems effectiveness for a given query, offering valuable insights for search effectiveness and query processing. Despite extensive research, QPPs face critical challenges in generalizing across diverse retrieval paradigms and collections. This paper provides a comprehensive evaluation of state-of-the-art QPPs (e.g. NQC, UQC), LETOR-based features, and newly explored dense-based predictors. Using diverse sparse rankers (BM25, DFree without and with query expansion) and hybrid or dense (SPLADE and ColBert) rankers and diverse test collections ROBUST, GOV2, WT10G, and MS MARCO; we investigate the relationships between predicted and actual performance, with a focus on generalization and robustness. Results show significant variability in predictors accuracy, with collections as the main factor and rankers next. Some sparse predictors perform somehow on some collections (TREC ROBUST and GOV2) but do not generalise to other collections (WT10G and MS-MARCO). While some predictors show promise in specific scenarios, their overall limitations constrain their utility for applications. We show that QPP-driven selective query processing offers only marginal gains, emphasizing the need for improved predictors that generalize across collections, align with dense retrieval architectures and are useful for downstream applications.

Problem

Research questions and friction points this paper is trying to address.

Evaluates QPP generalization across diverse retrieval paradigms

Assesses QPP robustness with sparse and dense rankers

Identifies limitations in QPP-driven selective query processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates QPPs across diverse retrieval paradigms

Tests sparse and dense rankers on multiple collections

Highlights need for generalizable QPP predictors

🔎 Similar Papers

No similar papers found.