Performance Consistency of Learning Methods for Information Retrieval Tasks

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical issue of model performance inconsistency in information retrieval (IR). It reveals substantial training instability of Transformer-based models across different random seeds, contrasting with the greater robustness of traditional statistical models. Using multi-seed experiments and bootstrapping resampling, the study systematically evaluates the standard deviations of F1-score and precision across 11 representative IR tasks. Results show that F1-score standard deviation exceeds 0.075 in 9 out of 11 tasks, and precision standard deviation exceeds 0.125 in 7 out of 11—demonstrating significantly higher variability than commonly assumed. To our knowledge, this is the first work to quantitatively identify and validate the root cause of result unreliability in IR. It advocates for a rigorous new evaluation paradigm incorporating multi-seed validation and explicit stability metrics, thereby establishing a methodological foundation for reproducible and trustworthy IR research.

Technology Category

Application Category

📝 Abstract
A range of approaches have been proposed for estimating the accuracy or robustness of the measured performance of IR methods. One is to use bootstrapping of test sets, which, as we confirm, provides an estimate of variation in performance. For IR methods that rely on a seed, such as those that involve machine learning, another approach is to use a random set of seeds to examine performance variation. Using three different IR tasks we have used such randomness to examine a range of traditional statistical learning models and transformer-based learning models. While the statistical models are stable, the transformer models show huge variation as seeds are changed. In 9 of 11 cases the F1-scores (in the range 0.0--1.0) had a standard deviation of over 0.075; while 7 of 11 precision values (also in the range 0.0--1.0) had a standard deviation of over 0.125. This is in a context where differences of less than 0.02 have been used as evidence of method improvement. Our findings highlight the vulnerability of transformer models to training instabilities and moreover raise questions about the reliability of previous results, thus underscoring the need for rigorous evaluation practices.
Problem

Research questions and friction points this paper is trying to address.

Evaluating performance consistency of IR methods using different randomness sources
Comparing stability between statistical models and transformer-based learning models
Highlighting transformer models' vulnerability to training instabilities and reliability concerns
Innovation

Methods, ideas, or system contributions that make the work stand out.

Used bootstrapping to estimate IR method performance variation
Applied random seeds to examine machine learning stability
Compared statistical models with transformer-based models
🔎 Similar Papers
No similar papers found.