🤖 AI Summary
Uncertainty quantification in missing data imputation is often overlooked, and the relationship between calibration quality and imputation accuracy remains poorly understood. This paper presents the first systematic empirical evaluation of six state-of-the-art imputation methods—statistical (MICE, SoftImpute), distribution-alignment (OT-Impute), and deep generative (GAIN, MIWAE, TabCSDI)—across multiple real-world datasets, under MCAR, MAR, and MNAR missingness mechanisms, and across varying missing rates. We propose a multi-path evaluation framework integrating repeated sampling variability, conditional distribution modeling, and predictive confidence quantification to rigorously assess uncertainty calibration. Results reveal that high imputation accuracy does not imply well-calibrated uncertainty estimates; significant trade-offs exist among accuracy, calibration fidelity, and computational efficiency across method categories. We identify several robust, reproducible configurations, providing actionable, evidence-based guidance for model selection in downstream machine learning and data cleaning tasks.
📝 Abstract
Handling missing data is a central challenge in data-driven analysis. Modern imputation methods not only aim for accurate reconstruction but also differ in how they represent and quantify uncertainty. Yet, the reliability and calibration of these uncertainty estimates remain poorly understood. This paper presents a systematic empirical study of uncertainty in imputation, comparing representative methods from three major families: statistical (MICE, SoftImpute), distribution alignment (OT-Impute), and deep generative (GAIN, MIWAE, TabCSDI). Experiments span multiple datasets, missingness mechanisms (MCAR, MAR, MNAR), and missingness rates. Uncertainty is estimated through three complementary routes: multi-run variability, conditional sampling, and predictive-distribution modeling, and evaluated using calibration curves and the Expected Calibration Error (ECE). Results show that accuracy and calibration are often misaligned: models with high reconstruction accuracy do not necessarily yield reliable uncertainty. We analyze method-specific trade-offs among accuracy, calibration, and runtime, identify stable configurations, and offer guidelines for selecting uncertainty-aware imputers in data cleaning and downstream machine learning pipelines.