🤖 AI Summary
To address the challenge of modeling high-order inter-column dependencies and achieving high-fidelity synthesis in tabular data generation—particularly under data scarcity or privacy-sensitive settings—this paper proposes GEM-T, a lightweight generative framework. Grounded in the principle of maximum entropy, GEM-T is the first method to directly match high-order statistical moments of the target distribution; it employs nonlinear feature transformations to jointly handle heterogeneous variables without deep neural networks, thereby capturing low-dimensional, interpretable correlation structures with high fidelity. Evaluated on 34 public datasets, GEM-T matches or surpasses state-of-the-art deep generative models across 23 downstream tasks, while reducing model parameters by one to three orders of magnitude. This yields substantial gains in computational efficiency and generalization capability. GEM-T establishes a novel paradigm for privacy-preserving and few-shot tabular data synthesis.
📝 Abstract
Tabular data dominates data science but poses challenges for generative models, especially when the data is limited or sensitive. We present a novel approach to generating synthetic tabular data based on the principle of maximum entropy -- MaxEnt -- called GEM-T, for ``generative entropy maximization for tables.'' GEM-T directly captures nth-order interactions -- pairwise, third-order, etc. -- among columns of training data. In extensive testing, GEM-T matches or exceeds deep neural network approaches previously regarded as state-of-the-art in 23 of 34 publicly available datasets representing diverse subject domains (68%). Notably, GEM-T involves orders-of-magnitude fewer trainable parameters, demonstrating that much of the information in real-world data resides in low-dimensional, potentially human-interpretable correlations, provided that the input data is appropriately transformed first. Furthermore, MaxEnt better handles heterogeneous data types (continuous vs. discrete vs. categorical), lack of local structure, and other features of tabular data. GEM-T represents a promising direction for light-weight high-performance generative models for structured data.