GEM-T: Generative Tabular Data via Fitting Moments

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of modeling high-order inter-column dependencies and achieving high-fidelity synthesis in tabular data generation—particularly under data scarcity or privacy-sensitive settings—this paper proposes GEM-T, a lightweight generative framework. Grounded in the principle of maximum entropy, GEM-T is the first method to directly match high-order statistical moments of the target distribution; it employs nonlinear feature transformations to jointly handle heterogeneous variables without deep neural networks, thereby capturing low-dimensional, interpretable correlation structures with high fidelity. Evaluated on 34 public datasets, GEM-T matches or surpasses state-of-the-art deep generative models across 23 downstream tasks, while reducing model parameters by one to three orders of magnitude. This yields substantial gains in computational efficiency and generalization capability. GEM-T establishes a novel paradigm for privacy-preserving and few-shot tabular data synthesis.

Technology Category

Application Category

📝 Abstract
Tabular data dominates data science but poses challenges for generative models, especially when the data is limited or sensitive. We present a novel approach to generating synthetic tabular data based on the principle of maximum entropy -- MaxEnt -- called GEM-T, for ``generative entropy maximization for tables.'' GEM-T directly captures nth-order interactions -- pairwise, third-order, etc. -- among columns of training data. In extensive testing, GEM-T matches or exceeds deep neural network approaches previously regarded as state-of-the-art in 23 of 34 publicly available datasets representing diverse subject domains (68%). Notably, GEM-T involves orders-of-magnitude fewer trainable parameters, demonstrating that much of the information in real-world data resides in low-dimensional, potentially human-interpretable correlations, provided that the input data is appropriately transformed first. Furthermore, MaxEnt better handles heterogeneous data types (continuous vs. discrete vs. categorical), lack of local structure, and other features of tabular data. GEM-T represents a promising direction for light-weight high-performance generative models for structured data.
Problem

Research questions and friction points this paper is trying to address.

Generating synthetic tabular data with limited or sensitive information
Capturing high-order interactions among columns in training data
Handling heterogeneous data types and lack of local structure
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses maximum entropy principle for data generation
Directly captures nth-order column interactions
Handles heterogeneous data types effectively
🔎 Similar Papers
No similar papers found.
M
Miao Li
Department of Pathology at Beth Israel Deaconess Medical Center (BIDMC), Boston, MA 02215
P
Phuc Nguyen
Department of Pathology at Beth Israel Deaconess Medical Center (BIDMC), Boston, MA 02215
C
Christopher Tam
Department of Pathology at Beth Israel Deaconess Medical Center (BIDMC), Boston, MA 02215
A
Alexandra Morgan
Department of Pathology at Beth Israel Deaconess Medical Center (BIDMC), Boston, MA 02215
Kenneth Ge
Kenneth Ge
Undergrad
computer sciencehuman computer interactionmachine learningartificial intelligence
R
Rahul Bansal
Department of Pathology at Beth Israel Deaconess Medical Center (BIDMC), Boston, MA 02215
L
Linzi Yu
Department of Pathology at Beth Israel Deaconess Medical Center (BIDMC), Boston, MA 02215
R
Rima Arnaout
Department of Medicine, the Bakar Computational Health Sciences Institute, and the UCSF UC Berkeley Joint Program for Computational Precision Health at the University of California San Francisco, San Francisco, CA 94143
R
Ramy Arnaout
Department of Pathology and the Division of Clinical Informatics, Department of Medicine, BIDMC and with Harvard Medical School, Boston, MA 02215