RandALO: Out-of-sample risk estimation in no time flat

πŸ“… 2024-09-15
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 3
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Estimating out-of-sample risk for large-scale, high-dimensional models faces a fundamental trade-off between computational cost and bias: $K$-fold cross-validation (CV) suffers from substantial bias, while leave-one-out CV (LOO-CV) is computationally intractable. This paper introduces RandALO (Randomized Approximate Leave-One-Out), the first unbiased, computationally efficient randomized estimator of LOO-CV risk. RandALO leverages random projection, perturbation analysis of linear models, asymptotic statistical inference, and low-rank matrix approximation to achieve theoretical consistency in high dimensions. Its computational complexity is significantly lower than that of $K$-fold CV. Empirically, RandALO matches LOO-CV’s accuracy while substantially outperforming $K$-fold CV in runtime on both synthetic and real-world datasets. An open-source Python package, `randalo`, is publicly available on PyPI and GitHub.

Technology Category

Application Category

πŸ“ Abstract
Estimating out-of-sample risk for models trained on large high-dimensional datasets is an expensive but essential part of the machine learning process, enabling practitioners to optimally tune hyperparameters. Cross-validation (CV) serves as the de facto standard for risk estimation but poorly trades off high bias ($K$-fold CV) for computational cost (leave-one-out CV). We propose a randomized approximate leave-one-out (RandALO) risk estimator that is not only a consistent estimator of risk in high dimensions but also less computationally expensive than $K$-fold CV. We support our claims with extensive simulations on synthetic and real data and provide a user-friendly Python package implementing RandALO available on PyPI as randalo and at https://github.com/cvxgrp/randalo.
Problem

Research questions and friction points this paper is trying to address.

Estimating out-of-sample risk efficiently for large datasets
Addressing high bias and computational cost in cross-validation
Providing a consistent and faster alternative to K-fold CV
Innovation

Methods, ideas, or system contributions that make the work stand out.

Randomized approximate leave-one-out risk estimator
Consistent risk estimation in high dimensions
More efficient than K-fold cross-validation
πŸ”Ž Similar Papers
No similar papers found.
P
Parth Nobel
Department of Electrical Engineering, Stanford University
Daniel LeJeune
Daniel LeJeune
Department of Statistics, Stanford University
E
E. Candès
Department of Statistics, Stanford University