🤖 AI Summary
TF-IDF, a cornerstone heuristic in information retrieval, lacks a formal statistical foundation for assessing term significance.
Method: Modeling term–document co-occurrence as binomial sampling, we show that TF-IDF asymptotically approximates the negative logarithm of the Fisher exact test p-value under large-sample conditions. We further introduce TF-ICF (inverse corpus frequency), rigorously proving its high correlation with the Fisher p-value under mild assumptions and its convergence to classical TF-IDF in the infinite-document limit.
Contribution/Results: This work establishes, for the first time, a formal statistical link between TF-IDF—the most influential term-weighting heuristic—and classical hypothesis testing. It endows TF-IDF with an interpretable statistical semantics: quantifying the significance of deviation from random term distribution. Moreover, it provides a principled theoretical framework for designing novel term-weighting schemes grounded in statistical inference, thereby bridging empirical IR practice with foundational statistical theory.
📝 Abstract
Term frequency-inverse document frequency, or TF-IDF for short, is arguably the most celebrated mathematical expression in the history of information retrieval. Conceived as a simple heuristic quantifying the extent to which a given term's occurrences are concentrated in any one given document out of many, TF-IDF and its many variants are routinely used as term-weighting schemes in diverse text analysis applications. There is a growing body of scholarship dedicated to placing TF-IDF on a sound theoretical foundation. Building on that tradition, this paper justifies the use of TF-IDF to the statistics community by demonstrating how the famed expression can be understood from a significance testing perspective. We show that the common TF-IDF variant TF-ICF is, under mild regularity conditions, closely related to the negative logarithm of the $p$-value from a one-tailed version of Fisher's exact test of statistical significance. As a corollary, we establish a connection between TF-IDF and the said negative log-transformed $p$-value under certain idealized assumptions. We further demonstrate, as a limiting case, that this same quantity converges to TF-IDF in the limit of an infinitely large document collection. The Fisher's exact test justification of TF-IDF equips the working statistician with a ready explanation of the term-weighting scheme's long-established effectiveness.