🤖 AI Summary
Existing prediction algorithm validation methods—such as k-fold cross-validation and public challenge frameworks—focus solely on in-sample generalization, neglecting dual uncertainties arising from both the data-generating population and the target prediction population. Consequently, they lack theoretical grounding for distributional shift and cross-population extrapolation. Method: We propose a comprehensive out-of-sample (OOS) evaluation framework grounded in statistical decision theory (SDT), the first to systematically integrate SDT into ML prediction validation. It explicitly models three sources of randomness: training sample, data-generating population, and target prediction population, and supports conditional prediction (e.g., clinical treatment selection). Contribution/Results: By unifying frequentist inference with cross-population extrapolation modeling, the framework enables formal, interpretable assessment of generalization capacity. It substantially enhances model credibility and decision robustness under distributional shift, advancing the evaluation paradigm from empirical validation toward theory-driven, statistically principled assessment.
📝 Abstract
We argue that comprehensive out-of-sample (OOS) evaluation using statistical decision theory (SDT) should replace the current practice of K-fold and Common Task Framework validation in machine learning (ML) research on prediction. SDT provides a formal frequentist framework for performing comprehensive OOS evaluation across all possible (1) training samples, (2) populations that may generate training data, and (3) populations of prediction interest. Regarding feature (3), we emphasize that SDT requires the practitioner to directly confront the possibility that the future may not look like the past and to account for a possible need to extrapolate from one population to another when building a predictive algorithm. For specificity, we consider treatment choice using conditional predictions with alternative restrictions on the state space of possible populations that may generate training data. We discuss application of SDT to the problem of predicting patient illness to inform clinical decision making. SDT is simple in abstraction, but it is often computationally demanding to implement. We call on ML researchers, econometricians, and statisticians to expand the domain within which implementation of SDT is tractable.