🤖 AI Summary
This study addresses the prevalent mismanagement of structured skip patterns and ordinal categorical variables in missing data imputation within large-scale surveys. To tackle this issue, the authors propose TabSODA+SKIP, a novel method that explicitly models both skip mechanisms and ordinal variable semantics within a diffusion framework for the first time. By integrating an Elucidated Diffusion Model with the EM algorithm, the approach propagates skip patterns during the denoising process. Ordinal responses are represented via cumulative probability latent variables, while nominal variables retain analog-bit encoding. Furthermore, CART decision trees are employed to automatically discover skip rules without requiring true masks. Evaluated on national survey datasets PATH and NSDUH, the method reduces MACE for ordinal variables by up to 23.7%, improves classification accuracy by as much as 9%, and achieves near-perfect precision in skip pattern recovery.
📝 Abstract
Missing data imputation in large-scale surveys faces two challenges that are not well handled by current tabular diffusion methods. First, \emph{structural skips}, cells made inapplicable by questionnaire design, should not be imputed but are often conflated with item nonresponse. Second, \emph{ordinal} responses encode ordered categories, yet most pipelines treat them as nominal levels through one-hot or analog-bit encodings. We introduce \textbf{TabSODA} (\textbf{Tab}ular diffusion with \textbf{S}kip pattern detection and \textbf{O}r\textbf{d}inal \textbf{A}wareness), an Expectation-Maximization (EM)-based diffusion imputer built on the Elucidated Diffusion Model (EDM) framework. TabSODA propagates structural skips through the denoising loss and reverse-time sampler, and represents ordinal variables with cumulative-probit scalar latents while retaining analog-bit encodings for nominal variables. When a codebook skip mask is available, TabSODA uses it directly; otherwise, the TabSODA+SKIP variant estimates the mask from raw responses and questionnaire order using a CART-based skip-pattern miner. On Population Assessment of Tobacco and Health (PATH) study and the National Survey on Drug Use and Health (NSDUH), two nationally representative U.S.\ surveys, TabSODA reduces ordinal MACE by up to $23.7\%$ and improves categorical accuracy by up to $9\%$ over the strongest baseline across MCAR, MAR, and MNAR masking. The skip miner achieves near-perfect precision on both datasets, allowing TabSODA+SKIP to closely track the codebook-mask variant.