GReaTER: Generate Realistic Tabular data after data Enhancement and Reduction

📅 2025-03-19

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

To address two key limitations of the GReaT framework—sparse semantic representation and weak inter-table relational modeling—in generating realistic multi-table tabular data, this paper proposes GReaTER: the first end-to-end large language model (LLM)-based generation framework tailored for multimodal (numerical, categorical, textual) multi-table data. Methodologically, GReaTER introduces (1) a data semantic enhancement system that maps low-semantic-density fields into context-rich natural language descriptions, thereby strengthening LLMs’ row-level modeling capability; and (2) a cross-table relational alignment mechanism that explicitly encodes foreign-key constraints and logical dependencies to enable coordinated multi-table generation. The framework adopts row-wise serialization, integrating semantic mapping, relational alignment, and lightweight dimensionality reduction. Evaluated on multi-table benchmarks, GReaTER consistently outperforms GReaT across all metrics, achieving state-of-the-art performance in statistical fidelity, downstream machine learning utility, and privacy compliance.

Technology Category

Application Category

📝 Abstract

Tabular data synthesis involves not only multi-table synthesis but also generating multi-modal data (e.g., strings and categories), which enables diverse knowledge synthesis. However, separating numerical and categorical data has limited the effectiveness of tabular data generation. The GReaT (Generate Realistic Tabular Data) framework uses Large Language Models (LLMs) to encode entire rows, eliminating the need to partition data types. Despite this, the framework's performance is constrained by two issues: (1) tabular data entries lack sufficient semantic meaning, limiting LLM's ability to leverage pre-trained knowledge for in-context learning, and (2) complex multi-table datasets struggle to establish effective relationships for collaboration. To address these, we propose GReaTER (Generate Realistic Tabular Data after data Enhancement and Reduction), which includes: (1) a data semantic enhancement system that improves LLM's understanding of tabular data through mapping, enabling better in-context learning, and (2) a cross-table connecting method to establish efficient relationships across complex tables. Experimental results show that GReaTER outperforms the GReaT framework.

Problem

Research questions and friction points this paper is trying to address.

Enhance semantic understanding of tabular data for LLMs

Improve multi-table relationship establishment in datasets

Generate realistic tabular data without partitioning data types

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLMs to encode entire tabular rows

Enhances data semantics for better LLM learning

Connects complex tables for efficient relationships

🔎 Similar Papers

On The Role of Prompt Construction In Enhancing Efficacy and Efficiency of LLM-Based Tabular Data Generation

2024-09-06arXiv.orgCitations: 0

TAEGAN: Generating Synthetic Tabular Data For Data Augmentation

2024-10-02arXiv.orgCitations: 1

MALLM-GAN: Multi-Agent Large Language Model as Generative Adversarial Network for Synthesizing Tabular Data

2024-06-15arXiv.orgCitations: 5