FairTabGen: Unifying Counterfactual and Causal Fairness in Synthetic Tabular Data Generation

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

In privacy-sensitive and data-scarce settings, synthesizing tabular data that simultaneously satisfies fairness and statistical utility remains challenging. Method: This paper proposes the first fairness-aware synthetic framework grounded in large language models (LLMs), unifying counterfactual and causal fairness throughout the entire generative and evaluative pipeline. It integrates context learning, prompt optimization, and fairness-driven data curation to jointly optimize fairness and utility under low-data regimes. Contributions/Results: Extensive experiments across multiple benchmark datasets demonstrate that our method improves counterfactual and causal fairness metrics by up to 10% over state-of-the-art GAN- and LLM-based baselines, while preserving high distributional fidelity and downstream task performance using less than 20% of the original training data. This significantly enhances the feasibility and practicality of fair synthetic data generation in small-sample scenarios.

Technology Category

Application Category

📝 Abstract

Generating synthetic data is crucial in privacy-sensitive, data-scarce settings, especially for tabular datasets widely used in real-world applications. A key challenge is improving counterfactual and causal fairness, while preserving high utility. We present FairTabGen, a fairness-aware large language model-based framework for tabular synthetic data generation. We integrate multiple fairness definitions including counterfactual and causal fairness into both its generation and evaluation pipelines. We use in-context learning, prompt refinement, and fairness-aware data curation to balance fairness and utility. Across diverse datasets, our method outperforms state-of-the-art GAN-based and LLM-based methods, achieving up to 10% improvements on fairness metrics such as demographic parity and path-specific causal effects while retaining statistical utility. Remarkably, it achieves these gains using less than 20% of the original data, highlighting its efficiency in low-data regimes. These results demonstrate a principled and practical approach for generating fair and useful synthetic tabular data.

Problem

Research questions and friction points this paper is trying to address.

Improving counterfactual and causal fairness in synthetic tabular data

Balancing fairness and utility in privacy-sensitive data generation

Addressing data scarcity while maintaining statistical utility metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based framework for tabular data generation

Integrates counterfactual and causal fairness definitions

Uses in-context learning and prompt refinement techniques

🔎 Similar Papers

Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models