AugMask: Training Diffusion Models on Incomplete Tabular Data via Stochastic Augmentation and Masking

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

146K/year
🤖 AI Summary
Diffusion models exhibit limited performance on tabular data with missing values due to their assumption of fully observed inputs. This work proposes AugMask, a framework that treats missing entries as uncertain context rather than targets for reconstruction by applying denoising supervision exclusively to observed values and employing random conditional augmentation. The key innovation lies in decoupling conditioning from supervision, which leads to a Rao–Blackwellized objective and a variance-weighted sensitivity penalty that discourages reliance on unreliable imputations. Integrating a lightweight auxiliary model, coordinate-wise masked denoising, and score-based diffusion generation, AugMask consistently outperforms existing missingness-aware methods across diverse datasets and missingness mechanisms.
📝 Abstract
Score-based diffusion models have emerged as prominent deep generative models; however, their application to tabular data remains challenging because their backbones assume fully specified inputs, whereas real-world tabular data often contain missing values. We propose AugMask, a plug-and-play training framework that adapts missing-unaware backbones to incomplete data by separating conditioning from supervision. AugMask 1) constructs numeric inputs via conditional stochastic augmentation using lightweight auxiliary models, and 2) applies denoising supervision only to observed coordinates. In effect, augmented missing entries serve as uncertain conditioning context rather than training targets. We connect this training rule to a Rao--Blackwellized objective and show that marginalizing missing entries yields a variance-weighted sensitivity penalty, discouraging over-reliance on uncertain completions. Across diverse datasets and missingness regimes, AugMask enables standard diffusion-based tabular generators to outperform specialized missing-aware baselines.
Problem

Research questions and friction points this paper is trying to address.

diffusion models
tabular data
missing values
incomplete data
generative modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion models
tabular data
missing values
stochastic augmentation
Rao-Blackwellization
🔎 Similar Papers