🤖 AI Summary
This work proposes a GAN-inspired privacy-preserving synthetic data generation method that avoids direct access to original data during training. Instead, it leverages fuzz testing to produce candidate samples and iteratively refines them through a discriminator-guided feedback loop combined with statistical distribution constraints to approximate the original data distribution. By innovatively integrating fuzz testing, adversarial discrimination, and indirect constraint mechanisms, the approach achieves strong privacy guarantees—effectively resisting membership inference and data reconstruction attacks—while preserving high data utility. Extensive experiments on four benchmark datasets demonstrate that the proposed method strikes a superior balance between privacy protection and data fidelity compared to existing techniques.
📝 Abstract
There is a need for synthetic training and test datasets that replicate statistical distributions of original datasets without compromising their confidentiality. A lot of research has been done in leveraging Generative Adversarial Networks (GANs) for synthetic data generation. However, the resulting models are either not accurate enough or are still vulnerable to membership inference attacks (MIA) or dataset reconstruction attacks since the original data has been leveraged in the training process. In this paper, we explore the feasibility of producing a synthetic test dataset with the same statistical properties as the original one, with only indirectly leveraging the original data in the generation process. The approach is inspired by GANs, with a generation step and a discrimination step. However, in our approach, we use a test generator (a fuzzer) to produce test data from an input specification, preserving constraints set by the original data; a discriminator model determines how close we are to the original data. By evolving samples and determining"good samples"with the discriminator, we can generate privacy-preserving data that follows the same statistical distributions are the original dataset, leading to a similar utility as the original data. We evaluated our approach on four datasets that have been used to evaluate the state-of-the-art techniques. Our experiments highlight the potential of our approach towards generating synthetic datasets that have high utility while preserving privacy.