🤖 AI Summary
This work addresses the inherent tension between formal privacy guarantees—particularly differential privacy (DP)—and downstream utility in high-stakes domains such as healthcare and finance. We conduct a systematic empirical evaluation of DP-integrated generative models, including GANs, VAEs, and LLMs. We propose the first domain-specialized, multimodal (tabular/image/text) evaluation framework that jointly quantifies privacy protection and task-specific utility. Our analysis uncovers a substantial performance gap between standard benchmarks and real-world deployment scenarios. Empirical results demonstrate a sharp utility degradation across mainstream methods when ε ≤ 4, revealing a critical misalignment between theoretical privacy guarantees and practical information leakage. The study establishes a reproducible, empirically grounded evaluation paradigm and calibration methodology for privacy-enhancing AI systems.
📝 Abstract
Privacy-preserving synthetic data offers a promising solution to harness segregated data in high-stakes domains where information is compartmentalized for regulatory, privacy, or institutional reasons. This survey provides a comprehensive framework for understanding the landscape of privacy-preserving synthetic data, presenting the theoretical foundations of generative models and differential privacy followed by a review of state-of-the-art methods across tabular data, images, and text. Our synthesis of evaluation approaches highlights the fundamental trade-off between utility for down-stream tasks and privacy guarantees, while identifying critical research gaps: the lack of realistic benchmarks representing specialized domains and insufficient empirical evaluations required to contextualise formal guarantees. Through empirical analysis of four leading methods on five real-world datasets from specialized domains, we demonstrate significant performance degradation under realistic privacy constraints ($epsilon leq 4$), revealing a substantial gap between results reported on general domain benchmarks and performance on domain-specific data. %Our findings highlight key challenges including unaccounted privacy leakage, insufficient empirical verification of formal guarantees, and a critical deficit of realistic benchmarks. These challenges underscore the need for robust evaluation frameworks, standardized benchmarks for specialized domains, and improved techniques to address the unique requirements of privacy-sensitive fields such that this technology can deliver on its considerable potential.