🤖 AI Summary
To address two key limitations of the GReaT framework—sparse semantic representation and weak inter-table relational modeling—in generating realistic multi-table tabular data, this paper proposes GReaTER: the first end-to-end large language model (LLM)-based generation framework tailored for multimodal (numerical, categorical, textual) multi-table data. Methodologically, GReaTER introduces (1) a data semantic enhancement system that maps low-semantic-density fields into context-rich natural language descriptions, thereby strengthening LLMs’ row-level modeling capability; and (2) a cross-table relational alignment mechanism that explicitly encodes foreign-key constraints and logical dependencies to enable coordinated multi-table generation. The framework adopts row-wise serialization, integrating semantic mapping, relational alignment, and lightweight dimensionality reduction. Evaluated on multi-table benchmarks, GReaTER consistently outperforms GReaT across all metrics, achieving state-of-the-art performance in statistical fidelity, downstream machine learning utility, and privacy compliance.
📝 Abstract
Tabular data synthesis involves not only multi-table synthesis but also generating multi-modal data (e.g., strings and categories), which enables diverse knowledge synthesis. However, separating numerical and categorical data has limited the effectiveness of tabular data generation. The GReaT (Generate Realistic Tabular Data) framework uses Large Language Models (LLMs) to encode entire rows, eliminating the need to partition data types. Despite this, the framework's performance is constrained by two issues: (1) tabular data entries lack sufficient semantic meaning, limiting LLM's ability to leverage pre-trained knowledge for in-context learning, and (2) complex multi-table datasets struggle to establish effective relationships for collaboration. To address these, we propose GReaTER (Generate Realistic Tabular Data after data Enhancement and Reduction), which includes: (1) a data semantic enhancement system that improves LLM's understanding of tabular data through mapping, enabling better in-context learning, and (2) a cross-table connecting method to establish efficient relationships across complex tables. Experimental results show that GReaTER outperforms the GReaT framework.