kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions

πŸ“… 2025-09-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Traditional imputation methods estimate only the conditional mean of missing values, failing to characterize predictive uncertainty. To address this, we propose kNNSamplerβ€”a k-nearest-neighbor-based stochastic multiple imputation method that consistently estimates the full conditional distribution of missing responses given observed covariates. Its core innovation lies in non-deterministic sampling from the observed responses of the k most similar units, enabling unbiased recovery of the missing-value distribution and principled uncertainty quantification. We establish theoretical guarantees: under mild regularity conditions, kNNSampler achieves asymptotic consistency in estimating the conditional distribution. Empirical evaluations across diverse missingness mechanisms (MCAR, MAR, MNAR) and data types demonstrate substantial improvements over state-of-the-art mean-based and model-driven imputation approaches. An open-source implementation ensures full reproducibility.

Technology Category

Application Category

πŸ“ Abstract
We study a missing-value imputation method, termed kNNSampler, that imputes a given unit's missing response by randomly sampling from the observed responses of the $k$ most similar units to the given unit in terms of the observed covariates. This method can sample unknown missing values from their distributions, quantify the uncertainties of missing values, and be readily used for multiple imputation. Unlike popular kNNImputer, which estimates the conditional mean of a missing response given an observed covariate, kNNSampler is theoretically shown to estimate the conditional distribution of a missing response given an observed covariate. Experiments demonstrate its effectiveness in recovering the distribution of missing values. The code for kNNSampler is made publicly available (https://github.com/SAP/knn-sampler).
Problem

Research questions and friction points this paper is trying to address.

Estimating conditional distributions of missing responses
Recovering unknown missing value distributions
Quantifying uncertainties in missing data imputation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Random sampling from k-nearest neighbors
Estimating conditional distributions of missing values
Quantifying uncertainties in multiple imputation
πŸ”Ž Similar Papers
No similar papers found.
P
Parastoo Pashmchi
SAP Labs France E-Mobility Research, EURECOM, Sophia Antipolis, France
J
Jerome Benoit
SAP Labs France E-Mobility Research
Motonobu Kanagawa
Motonobu Kanagawa
EURECOM
statisticsmachine learningapplied mathematicssimulationprobabilistic numerics