Learning What Not to Impute: An Uncertainty-Aware Diffusion Framework for Meaningful Missingness

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses a critical limitation of conventional missing value imputation methods, which typically assume data are missing completely at random and thereby overlook the presence of "informative missingness"—where the absence of data itself conveys meaningful semantic information. To tackle this issue, the authors propose Diff-Joint, a novel framework that jointly models informative and imputable missingness for the first time. Leveraging a diffusion model, Diff-Joint simultaneously learns the underlying data distribution and the latent missingness mask, while incorporating an uncertainty-aware mechanism to guide selective imputation. Through conditional sampling and uncertainty-aware aggregation, the method accurately distinguishes informative missing entries on both synthetic and real-world datasets, achieves competitive imputation accuracy, and significantly enhances performance in downstream tasks such as classification and regression.

📝 Abstract

Missing value imputation is a fundamental task in machine learning, with most existing methods assuming that all missing entries correspond to unobserved regular values. In many real-world datasets, however, missingness may arise from two distinct sources: some entries are meaningfully missing (intrinsically absent and semantically valid), while others are missing due to the observation process and should be imputed. We formalize this distinction as a selective imputation problem, where the goal is to jointly infer which missing entries should be preserved and which should be recovered. To address this challenge, we propose Diff-Joint, a diffusion-based framework that jointly models tabular data together with a latent missingness mask. The method alternates between conditional sampling and uncertainty-aware aggregation to iteratively refine both imputed values and missingness labels. Empirical results on synthetic and real-world datasets demonstrate that Diff-Joint effectively identifies meaningfully missing entries while achieving competitive imputation accuracy and improved downstream task performance.

Problem

Research questions and friction points this paper is trying to address.

missing value imputation

meaningful missingness

selective imputation

uncertainty-aware

tabular data

Innovation

Methods, ideas, or system contributions that make the work stand out.

selective imputation

uncertainty-aware diffusion

meaningful missingness