Private Synthetic Data Generation in Small Memory

📅 2024-12-12

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address differentially private synthetic data generation under stringent memory constraints, this paper proposes PrivHP, a lightweight algorithm. PrivHP integrates hierarchical cumulative distribution function (CDF) approximation with private frequency sketches, achieving $varepsilon$-differential privacy using only $O(k log^2 |X|)$ memory. Its key innovation lies in a skew-aware joint optimization mechanism for the pruning parameter $k$ and hierarchy depth, enabling the first derivation of an interpretable, explicitly decoupled Wasserstein error bound—separating noise, pruning, and estimation errors. Theoretically and empirically, this bound balances space efficiency, privacy guarantees, and data utility. Experiments demonstrate that PrivHP significantly outperforms state-of-the-art methods in synthetic data quality under low-memory regimes, while maintaining rigorous privacy compliance.

Technology Category

Application Category

📝 Abstract

We propose $mathtt{PrivHP}$, a lightweight synthetic data generator with extit{differential privacy} guarantees. $mathtt{PrivHP}$ uses a novel hierarchical decomposition that approximates the input's cumulative distribution function (CDF) in bounded memory. It balances hierarchy depth, noise addition, and pruning of low-frequency subdomains while preserving frequent ones. Private sketches estimate subdomain frequencies efficiently without full data access. A key feature is the pruning parameter $k$, which controls the trade-off between space and utility. We define the skew measure $mathtt{tail}_k$, capturing all but the top $k$ subdomain frequencies. Given a dataset $mathcal{X}$, $mathtt{PrivHP}$ uses $M=mathcal{O}(klog^2 |X|)$ space and, for input domain $Omega = [0,1]$, ensures $varepsilon$-differential privacy. It yields a generator with expected Wasserstein distance: [ mathcal{O}left(frac{log^2 M}{varepsilon n} + frac{||mathtt{tail}_k(mathcal{X})||_1}{M n} ight) ] from the empirical distribution. This parameterized trade-off offers a level of flexibility unavailable in prior work. We also provide interpretable utility bounds that account for hierarchy depth, privacy noise, pruning, and frequency estimation errors.

Problem

Research questions and friction points this paper is trying to address.

Generates private synthetic data efficiently

Ensures differential privacy in data generation

Balances memory usage and data utility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight differential privacy generator

Hierarchical CDF approximation

Efficient private frequency sketching

🔎 Similar Papers

No similar papers found.

Authors to Follow