🤖 AI Summary
In privacy-sensitive and data-scarce settings, synthesizing tabular data that simultaneously satisfies fairness and statistical utility remains challenging. Method: This paper proposes the first fairness-aware synthetic framework grounded in large language models (LLMs), unifying counterfactual and causal fairness throughout the entire generative and evaluative pipeline. It integrates context learning, prompt optimization, and fairness-driven data curation to jointly optimize fairness and utility under low-data regimes. Contributions/Results: Extensive experiments across multiple benchmark datasets demonstrate that our method improves counterfactual and causal fairness metrics by up to 10% over state-of-the-art GAN- and LLM-based baselines, while preserving high distributional fidelity and downstream task performance using less than 20% of the original training data. This significantly enhances the feasibility and practicality of fair synthetic data generation in small-sample scenarios.
📝 Abstract
Generating synthetic data is crucial in privacy-sensitive, data-scarce settings, especially for tabular datasets widely used in real-world applications. A key challenge is improving counterfactual and causal fairness, while preserving high utility. We present FairTabGen, a fairness-aware large language model-based framework for tabular synthetic data generation. We integrate multiple fairness definitions including counterfactual and causal fairness into both its generation and evaluation pipelines. We use in-context learning, prompt refinement, and fairness-aware data curation to balance fairness and utility. Across diverse datasets, our method outperforms state-of-the-art GAN-based and LLM-based methods, achieving up to 10% improvements on fairness metrics such as demographic parity and path-specific causal effects while retaining statistical utility. Remarkably, it achieves these gains using less than 20% of the original data, highlighting its efficiency in low-data regimes. These results demonstrate a principled and practical approach for generating fair and useful synthetic tabular data.