A Comprehensive Survey of Synthetic Tabular Data Generation

📅 2025-04-23

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing surveys predominantly focus on isolated paradigms—such as GANs or VAEs—lacking systematic integration of traditional models, diffusion models, and large language models (LLMs), and omitting the full generative pipeline (generation → post-processing → evaluation). To address this gap, we propose the first unified taxonomy for synthetic tabular data generation, categorizing methods into three foundational paradigms: traditional models, diffusion models, and LLMs. We formally define an end-to-end technical workflow and conduct a cross-paradigm analysis of shared challenges and synergies. Our survey comprehensively integrates energy-based models, VAEs, GANs, diffusion models, and LLMs, while incorporating critical dimensions including privacy preservation, evaluation metrics, and post-processing techniques. This constitutes the field’s first holistic survey, clarifying evolutionary trajectories, identifying practical deployment scenarios, and outlining key research directions—thereby providing a systematic reference framework for both academia and industry.

Technology Category

Application Category

📝 Abstract

Tabular data remains one of the most prevalent and critical data formats across diverse real-world applications. However, its effective use in machine learning (ML) is often constrained by challenges such as data scarcity, privacy concerns, and class imbalance. Synthetic data generation has emerged as a promising solution, leveraging generative models to learn the distribution of real datasets and produce high-fidelity, privacy-preserving samples. Various generative paradigms have been explored, including energy-based models (EBMs), variational autoencoders (VAEs), generative adversarial networks (GANs), large language models (LLMs), and diffusion models. While several surveys have investigated synthetic tabular data generation, most focus on narrow subdomains or specific generative methods, such as GANs, diffusion models, or privacy-preserving techniques. This limited scope often results in fragmented insights, lacking a comprehensive synthesis that bridges diverse approaches. In particular, recent advances driven by LLMs and diffusion-based models remain underexplored. This gap hinders a holistic understanding of the field`s evolution, methodological interplay, and open challenges. To address this, our survey provides a unified and systematic review of synthetic tabular data generation. Our contributions are threefold: (1) we propose a comprehensive taxonomy that organizes existing methods into traditional approaches, diffusion-based methods, and LLM-based models, and provide an in-depth comparative analysis; (2) we detail the complete pipeline for synthetic tabular data generation, including data synthesis, post-processing, and evaluation; (3) we identify major challenges, explore real-world applications, and outline open research questions and future directions to guide future work in this rapidly evolving area.

Problem

Research questions and friction points this paper is trying to address.

Addressing challenges in ML with synthetic tabular data generation

Providing a unified review of diverse generative methods

Identifying open research questions in synthetic data field

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative models for synthetic tabular data

Unified taxonomy: traditional, diffusion, LLM-based

Pipeline: synthesis, post-processing, evaluation

🔎 Similar Papers

TAEGAN: Generating Synthetic Tabular Data For Data Augmentation