Stop Using the Wilcoxon Test: Myth, Misconception and Misuse in IR Research

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study addresses the long-standing misuse of the Wilcoxon signed-rank test in information retrieval (IR) evaluation, which has led to inflated Type I error rates and unreliable statistical inferences. Through a systematic literature review, theoretical analysis, and empirical validation using TREC data, the authors demonstrate that this misapplication stems from a common misconception: the Wilcoxon test is not a universally safe nonparametric alternative to the t-test and is particularly ill-suited for typical IR experimental settings. The work clarifies that the test’s assumptions are violated by the inherent dependencies and discrete score distributions characteristic of IR evaluation. Consequently, the paper urges the IR community to abandon the Wilcoxon signed-rank test in favor of more appropriate statistical methods to enhance the rigor and reliability of experimental methodology.

📝 Abstract

In benchmarking of Information Retrieval systems, the Wilcoxon signed-rank test is often treated as a safer alternative to the t-test. This belief is fueled by textbooks and recommendations that portray Wilcoxon as the proper non-parametric alternative because metric scores are not normally distributed. We argue that this narrative is misleading and harmful. A careful review of Statistics textbooks reveals inconsistencies and omissions in how the assumptions underlying these tests are presented, fostering confusion that has propagated into IR research. As a result, Wilcoxon has been routinely misapplied for decades, creating a false sense of safety against a threat that was never there to begin with, while introducing another one so severe that it virtually guarantees the test will break down and mislead researchers. Through a combination of systematic literature review, analysis and empirical demonstrations with TREC data, we show how and why the Wilcoxon test easily loses control of its Type I error rate in IR settings. We conclude that the continued use of Wilcoxon in IR evaluation is unjustified and that abandoning it would improve the methodological soundness of our field.

Problem

Research questions and friction points this paper is trying to address.

Wilcoxon test

Information Retrieval

statistical misuse

Type I error

non-parametric test

Innovation

Methods, ideas, or system contributions that make the work stand out.

Wilcoxon signed-rank test

Type I error

Information Retrieval evaluation