Prediction-Augmented Trees for Reliable Statistical Inference

📅 2025-10-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the safety of integrating machine learning (ML) prediction with statistical inference in scientific discovery. We propose two learning-augmented estimators: PART, a decision-tree-based estimator, and its limiting form PAQ. Leveraging a small number $n$ of high-accuracy labeled samples and a large number $N$ of unlabeled data, our method greedily constructs prediction-augmented residual trees to tightly couple ML predictions with gold-standard labels. We rigorously derive the asymptotic distribution and construct valid confidence intervals. Theoretically, our estimators achieve a variance convergence rate of $O(N^{-1} + n^{-4})$, substantially improving upon the $O(N^{-1} + n^{-1})$ rate of existing methods. Empirical evaluation on real-world datasets from ecology, astronomy, and census surveys demonstrates that PART and PAQ significantly outperform baselines—including PPI—in both estimation accuracy and confidence interval coverage. Our work establishes a new paradigm for trustworthy, ML-augmented statistical inference.

Technology Category

Application Category

📝 Abstract
The remarkable success of machine learning (ML) in predictive tasks has led scientists to incorporate ML predictions as a core component of the scientific discovery pipeline. This was exemplified by the landmark achievement of AlphaFold (Jumper et al. (2021)). In this paper, we study how ML predictions can be safely used in statistical analysis of data towards scientific discovery. In particular, we follow the framework introduced by Angelopoulos et al. (2023). In this framework, we assume access to a small set of $n$ gold-standard labeled samples, a much larger set of $N$ unlabeled samples, and a ML model that can be used to impute the labels of the unlabeled data points. We introduce two new learning-augmented estimators: (1) Prediction-Augmented Residual Tree (PART), and (2) Prediction-Augmented Quadrature (PAQ). Both estimators have significant advantages over existing estimators like PPI and PPI++ introduced by Angelopoulos et al. (2023) and Angelopoulos et al. (2024), respectively. PART is a decision-tree based estimator built using a greedy criterion. We first characterize PART's asymptotic distribution and demonstrate how to construct valid confidence intervals. Then we show that PART outperforms existing methods in real-world datasets from ecology, astronomy, and census reports, among other domains. This leads to estimators with higher confidence, which is the result of using both the gold-standard samples and the machine learning predictions. Finally, we provide a formal proof of the advantage of PART by exploring PAQ, an estimation that arises when considering the limit of PART when the depth its tree grows to infinity. Under appropriate assumptions in the input data we show that the variance of PAQ shrinks at rate of $O(N^{-1} + n^{-4})$, improving significantly on the $O(N^{-1}+n^{-1})$ rate of existing methods.
Problem

Research questions and friction points this paper is trying to address.

Safely incorporating ML predictions into statistical analysis for science
Developing reliable estimators using limited labeled and abundant unlabeled data
Improving confidence intervals and variance rates over existing methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prediction-Augmented Residual Tree for statistical inference
Prediction-Augmented Quadrature with improved variance rate
Decision-tree estimator using gold-standard and ML predictions
V
Vikram Kher
Department of Computer Science, Yale University
A
Argyris Oikonomou
Department of Computer Science, Yale University
Manolis Zampetakis
Manolis Zampetakis
Yale University