🤖 AI Summary
Genomic data sharing is severely hindered by privacy concerns, undermining the reproducibility and validation of genome-wide association study (GWAS) results. To address this, we propose the first two-stage framework that jointly optimizes privacy protection and statistical utility. In Stage I, we design a biology-informed XOR-based differential privacy (DP) mechanism tailored to genomic data structures, substantially enhancing resilience against membership inference attacks. In Stage II, we introduce a minor allele frequency (MAF)-calibrated decoding scheme grounded in optimal transport theory, coupled with binary-space mapping, to faithfully reconstruct allele frequency distributions and preserve GWAS statistical power. Experiments on three real-world genomic datasets demonstrate that our method significantly outperforms state-of-the-art local DP and synthetic data approaches across false discovery rate control, privacy guarantee strength (measured by ε), and data utility—achieving, for the first time, synergistic optimization of high-precision variant detection and strong formal privacy.
📝 Abstract
As genomic research has become increasingly popular in recent years, the sharing of datasets has remained limited due to privacy concerns. This limitation hinders the reproduction and validation of research outcomes, which are essential for identifying computation errors during the research process. In this paper, we introduce PROVGEN, a privacy-preserving method for sharing genomic datasets that facilitates reproducibility and outcome validation in genome-wide association studies (GWAS). Our approach encodes genomic data into binary space and applies a two-stage process. First, we generate a differentially private version of the dataset using an XOR-based mechanism that incorporates biological characteristics. Second, we restore data utility by adjusting the Minor Allele Frequency (MAF) values in the noisy dataset to align with published MAFs using optimal transport. Finally, we decode the processed data back into its genomic form for further use. We evaluate PROVGEN on three real-world genomic datasets and compare it with local differential privacy and three synthesis-based methods. We show that our proposed scheme outperforms all existing methods in detecting GWAS outcome errors, achieves better utility, provides higher privacy protection against membership inference attacks (MIAs). By adopting our method, genomic researchers will be inclined to share differentially private datasets while maintaining high data quality.