GEM-T: Generative Tabular Data via Fitting Moments

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address the challenge of modeling high-order inter-column dependencies and achieving high-fidelity synthesis in tabular data generation—particularly under data scarcity or privacy-sensitive settings—this paper proposes GEM-T, a lightweight generative framework. Grounded in the principle of maximum entropy, GEM-T is the first method to directly match high-order statistical moments of the target distribution; it employs nonlinear feature transformations to jointly handle heterogeneous variables without deep neural networks, thereby capturing low-dimensional, interpretable correlation structures with high fidelity. Evaluated on 34 public datasets, GEM-T matches or surpasses state-of-the-art deep generative models across 23 downstream tasks, while reducing model parameters by one to three orders of magnitude. This yields substantial gains in computational efficiency and generalization capability. GEM-T establishes a novel paradigm for privacy-preserving and few-shot tabular data synthesis.

Technology Category

Application Category

📝 Abstract

Tabular data dominates data science but poses challenges for generative models, especially when the data is limited or sensitive. We present a novel approach to generating synthetic tabular data based on the principle of maximum entropy -- MaxEnt -- called GEM-T, for ``generative entropy maximization for tables.'' GEM-T directly captures nth-order interactions -- pairwise, third-order, etc. -- among columns of training data. In extensive testing, GEM-T matches or exceeds deep neural network approaches previously regarded as state-of-the-art in 23 of 34 publicly available datasets representing diverse subject domains (68%). Notably, GEM-T involves orders-of-magnitude fewer trainable parameters, demonstrating that much of the information in real-world data resides in low-dimensional, potentially human-interpretable correlations, provided that the input data is appropriately transformed first. Furthermore, MaxEnt better handles heterogeneous data types (continuous vs. discrete vs. categorical), lack of local structure, and other features of tabular data. GEM-T represents a promising direction for light-weight high-performance generative models for structured data.

Problem

Research questions and friction points this paper is trying to address.

Generating synthetic tabular data with limited or sensitive information

Capturing high-order interactions among columns in training data

Handling heterogeneous data types and lack of local structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses maximum entropy principle for data generation

Directly captures nth-order column interactions

Handles heterogeneous data types effectively

🔎 Similar Papers

On The Role of Prompt Construction In Enhancing Efficacy and Efficiency of LLM-Based Tabular Data Generation