A Fisher's exact test justification of the TF-IDF term-weighting scheme

📅 2025-07-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
TF-IDF, a cornerstone heuristic in information retrieval, lacks a formal statistical foundation for assessing term significance. Method: Modeling term–document co-occurrence as binomial sampling, we show that TF-IDF asymptotically approximates the negative logarithm of the Fisher exact test p-value under large-sample conditions. We further introduce TF-ICF (inverse corpus frequency), rigorously proving its high correlation with the Fisher p-value under mild assumptions and its convergence to classical TF-IDF in the infinite-document limit. Contribution/Results: This work establishes, for the first time, a formal statistical link between TF-IDF—the most influential term-weighting heuristic—and classical hypothesis testing. It endows TF-IDF with an interpretable statistical semantics: quantifying the significance of deviation from random term distribution. Moreover, it provides a principled theoretical framework for designing novel term-weighting schemes grounded in statistical inference, thereby bridging empirical IR practice with foundational statistical theory.

Technology Category

Application Category

📝 Abstract
Term frequency-inverse document frequency, or TF-IDF for short, is arguably the most celebrated mathematical expression in the history of information retrieval. Conceived as a simple heuristic quantifying the extent to which a given term's occurrences are concentrated in any one given document out of many, TF-IDF and its many variants are routinely used as term-weighting schemes in diverse text analysis applications. There is a growing body of scholarship dedicated to placing TF-IDF on a sound theoretical foundation. Building on that tradition, this paper justifies the use of TF-IDF to the statistics community by demonstrating how the famed expression can be understood from a significance testing perspective. We show that the common TF-IDF variant TF-ICF is, under mild regularity conditions, closely related to the negative logarithm of the $p$-value from a one-tailed version of Fisher's exact test of statistical significance. As a corollary, we establish a connection between TF-IDF and the said negative log-transformed $p$-value under certain idealized assumptions. We further demonstrate, as a limiting case, that this same quantity converges to TF-IDF in the limit of an infinitely large document collection. The Fisher's exact test justification of TF-IDF equips the working statistician with a ready explanation of the term-weighting scheme's long-established effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Justify TF-IDF using Fisher's exact test perspective
Relate TF-ICF to negative log p-value statistically
Connect TF-IDF to significance testing in large collections
Innovation

Methods, ideas, or system contributions that make the work stand out.

Links TF-IDF to Fisher's exact test
Demonstrates TF-ICF as negative log p-value
Shows convergence to TF-IDF in large collections
🔎 Similar Papers
No similar papers found.
Paul Sheridan
Paul Sheridan
Assistant Professor, University of Prince Edward Island
complex networksknowledge representationlanguage AImulti-omics analysisstatistical modeling
Zeyad Ahmed
Zeyad Ahmed
Student, University of Prince Edward Island
computational text analysismachine learningcomputational genomics
A
Aitazaz A. Farooque
Canadian Centre for Climate Change and Adaptation, University of Prince Edward Island, St Peters Bay, PE, Canada; Faculty of Sustainable Design Engineering, University of Prince Edward Island, Charlottetown, PE, Canada