BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of generating high-dimensional, low-sample-size (HDLSS) tabular data, where issues such as ill-posed density estimation, strong local correlations, sparse cross-group dependencies, heavy-tailed non-Gaussian marginals, heteroscedastic noise, and structured missingness hinder realistic synthesis. To overcome these difficulties, the authors propose a block–subunit generative framework that partitions high-dimensional features into a small number of latent blocks. Global dependencies are modeled through shared low-dimensional subunit latent variables, and—uniquely—the approach integrates diffusion priors with a copula-based decoder to enable stable dependency learning in the low-dimensional block latent space. Coupled with flexible marginal distribution modeling and an explicit mechanism for handling missing data, the method significantly enhances the fidelity and stability of synthetic data, outperforming existing unstructured tabular generation approaches.
📝 Abstract
High-Dimensional Low-Sample Size (HDLSS) tabular domains (e.g., omics) are characterized by $n \ll m$, where $n$ = number of samples, and $m$ = number of features. Such domains often exhibit strong local correlation groups, sparse cross-group dependencies, heavy-tailed non-Gaussian marginals, heteroscedastic noise, and structured missingness, making direct density learning in $\mathbb{R}^m$ ill-conditioned since $n \ll m$. We propose BSTabDiff, a block-subunit generative framework that partitions the $m$ observed features into $M$ latent blocks ($M \ll m$) and generates each block via a shared low-dimensional subunit variable, concentrating global dependence learning in the compact block-latent space $\mathbb{R}^M$ while decoding to the full feature space with copula-driven dependence, flexible per-feature marginals, and explicit missingness mechanisms. BSTabDiff supports modern deep priors on block latents, including diffusion and normalizing flows, enabling stable synthesis and controllable benchmark generation in the HDLSS regime. Empirically, BSTabDiff produces more realistic and stable high-dimensional synthetic data when compared with unstructured tabular generators on HDLSS data.
Problem

Research questions and friction points this paper is trying to address.

High-Dimensional Low-Sample Size
Tabular Data Generation
Ill-conditioned Density Learning
Structured Missingness
Non-Gaussian Marginals
Innovation

Methods, ideas, or system contributions that make the work stand out.

block-subunit diffusion
high-dimensional tabular generation
copula-based dependence
structured missingness
low-sample-size generative modeling
A
Al Zadid Sultan Bin Habib
Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26506, USA
M
Md Younus Ahamed
Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26506, USA
P
Prashnna Gyawali
Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26506, USA
Gianfranco Doretto
Gianfranco Doretto
West Virginia University
Computer VisionMachine LearningBiomedical Data ScienceArtificial Intelligence
D
Donald A. Adjeroh
Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26506, USA