🤖 AI Summary
To address differentially private synthetic data generation under stringent memory constraints, this paper proposes PrivHP, a lightweight algorithm. PrivHP integrates hierarchical cumulative distribution function (CDF) approximation with private frequency sketches, achieving $varepsilon$-differential privacy using only $O(k log^2 |X|)$ memory. Its key innovation lies in a skew-aware joint optimization mechanism for the pruning parameter $k$ and hierarchy depth, enabling the first derivation of an interpretable, explicitly decoupled Wasserstein error bound—separating noise, pruning, and estimation errors. Theoretically and empirically, this bound balances space efficiency, privacy guarantees, and data utility. Experiments demonstrate that PrivHP significantly outperforms state-of-the-art methods in synthetic data quality under low-memory regimes, while maintaining rigorous privacy compliance.
📝 Abstract
We propose $mathtt{PrivHP}$, a lightweight synthetic data generator with extit{differential privacy} guarantees. $mathtt{PrivHP}$ uses a novel hierarchical decomposition that approximates the input's cumulative distribution function (CDF) in bounded memory. It balances hierarchy depth, noise addition, and pruning of low-frequency subdomains while preserving frequent ones. Private sketches estimate subdomain frequencies efficiently without full data access. A key feature is the pruning parameter $k$, which controls the trade-off between space and utility. We define the skew measure $mathtt{tail}_k$, capturing all but the top $k$ subdomain frequencies. Given a dataset $mathcal{X}$, $mathtt{PrivHP}$ uses $M=mathcal{O}(klog^2 |X|)$ space and, for input domain $Omega = [0,1]$, ensures $varepsilon$-differential privacy. It yields a generator with expected Wasserstein distance: [ mathcal{O}left(frac{log^2 M}{varepsilon n} + frac{||mathtt{tail}_k(mathcal{X})||_1}{M n}
ight) ] from the empirical distribution. This parameterized trade-off offers a level of flexibility unavailable in prior work. We also provide interpretable utility bounds that account for hierarchy depth, privacy noise, pruning, and frequency estimation errors.