Spatially Robust Inference with Predicted and Missing at Random Labels

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of accurate uncertainty quantification for statistical inference based on machine learning predictions when spatial dependence and missing-at-random (MAR) labels coexist. The authors propose a doubly robust estimator integrated with cross-fitting and introduce a novel jackknife-based heteroskedasticity and autocorrelation consistent (HAC) variance correction method that explicitly disentangles the spatial dependence structure from the fold-wise correlations induced by cross-fitting. This approach is the first to yield asymptotically valid confidence intervals for spatial data under MAR labeling. Empirical evaluations on both simulated and real-world datasets demonstrate markedly improved calibration of finite-sample confidence intervals, with particularly robust performance under clustered sampling designs and high missingness rates.

Technology Category

Application Category

📝 Abstract
When outcome data are expensive or onerous to collect, scientists increasingly substitute predictions from machine learning and AI models for unlabeled cases, a process which has consequences for downstream statistical inference. While recent methods provide valid uncertainty quantification under independent sampling, real-world applications involve missing at random (MAR) labeling and spatial dependence. For inference in this setting, we propose a doubly robust estimator with cross-fit nuisances. We show that cross-fitting induces fold-level correlation that distorts spatial variance estimators, producing unstable or overly conservative confidence intervals. To address this, we propose a jackknife spatial heteroscedasticity and autocorrelation consistent (HAC) variance correction that separates spatial dependence from fold-induced noise. Under standard identification and dependence conditions, the resulting intervals are asymptotically valid. Simulations and benchmark datasets show substantial improvement in finite-sample calibration, particularly under MAR labeling and clustered sampling.
Problem

Research questions and friction points this paper is trying to address.

missing at random
spatial dependence
statistical inference
machine learning predictions
uncertainty quantification
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial inference
missing at random
cross-fitting
doubly robust estimator
jackknife HAC