🤖 AI Summary
This work addresses missing value imputation and synthetic data generation for tabular data—critical tasks in high-stakes domains such as healthcare and finance. We propose the first unified diffusion-based framework enabling end-to-end joint optimization for both tasks. Methodologically, we introduce a conditional attention mechanism, an encoder-decoder Transformer denoising network, and a dynamic masking strategy to explicitly capture inter-variable dependencies and adaptively handle diverse missingness patterns. Evaluated on multiple benchmark datasets, our approach achieves state-of-the-art performance across three key dimensions: machine learning utility (e.g., downstream F1 score), statistical fidelity (e.g., Jensen–Shannon divergence), and privacy preservation (e.g., adversarial attack success rate), significantly outperforming VAEs, GANs, and existing diffusion baselines. Our core contribution is the first diffusion paradigm tailored for joint tabular imputation and synthesis—offering superior modeling capacity, interpretability, and practical robustness.
📝 Abstract
Data imputation and data generation have important applications for many domains, like healthcare and finance, where incomplete or missing data can hinder accurate analysis and decision-making. Diffusion models have emerged as powerful generative models capable of capturing complex data distributions across various data modalities such as image, audio, and time series data. Recently, they have been also adapted to generate tabular data. In this paper, we propose a diffusion model for tabular data that introduces three key enhancements: (1) a conditioning attention mechanism, (2) an encoder-decoder transformer as the denoising network, and (3) dynamic masking. The conditioning attention mechanism is designed to improve the model's ability to capture the relationship between the condition and synthetic data. The transformer layers help model interactions within the condition (encoder) or synthetic data (decoder), while dynamic masking enables our model to efficiently handle both missing data imputation and synthetic data generation tasks within a unified framework. We conduct a comprehensive evaluation by comparing the performance of diffusion models with transformer conditioning against state-of-the-art techniques, such as Variational Autoencoders, Generative Adversarial Networks and Diffusion Models, on benchmark datasets. Our evaluation focuses on the assessment of the generated samples with respect to three important criteria, namely: (1) Machine Learning efficiency, (2) statistical similarity, and (3) privacy risk mitigation. For the task of data imputation, we consider the efficiency of the generated samples across different levels of missing features.