🤖 AI Summary
Current retrieval evaluation relies on heuristic query sets that introduce implicit biases and fail to reflect true system performance. This work reframes evaluation as a statistical estimation problem and proposes the first semantics-aware evaluation framework grounded in semantic stratification. By clustering entities to construct an interpretable document semantic space, the method systematically generates queries that cover missing semantic strata, enabling comprehensive and transparent assessment. The framework provides formal guarantees on semantic coverage and uncovers structural causes of retrieval failures, thereby transcending the limitations of conventional average-based metrics. Empirical results across multiple benchmarks demonstrate that the approach effectively identifies systematic coverage gaps, significantly enhancing evaluation stability and the reliability of downstream decisions.
📝 Abstract
Retrieval quality is the primary bottleneck for accuracy and robustness in retrieval-augmented generation (RAG). Current evaluation relies on heuristically constructed query sets, which introduce a hidden intrinsic bias. We formalize retrieval evaluation as a statistical estimation problem, showing that metric reliability is fundamentally limited by the evaluation-set construction. We further introduce \emph{semantic stratification}, which grounds evaluation in corpus structure by organizing documents into an interpretable global space of entity-based clusters and systematically generating queries for missing strata. This yields (1) formal semantic coverage guarantees across retrieval regimes and (2) interpretable visibility into retrieval failure modes.
Experiments across multiple benchmarks and retrieval methods validate our framework. The results expose systematic coverage gaps, identify structural signals that explain variance in retrieval performance, and show that stratified evaluation yields more stable and transparent assessments while supporting more trustworthy decision-making than aggregate metrics.