🤖 AI Summary
This paper addresses statistical inference under unbounded differential privacy (DP) when the sample size $n$ is both sensitive and unknown—a setting where conventional DP methods, which assume known $n$, fail. To overcome this limitation, we first establish asymptotic equivalence between sampling distributions under unbounded and bounded DP. We then propose a reversible jump MCMC algorithm coupled with a Monte Carlo EM procedure to enable finite-sample Bayesian inference and maximum likelihood estimation without requiring knowledge of $n$. By integrating Dirichlet modeling with linear regression, we provide theoretical guarantees on posterior consistency and approximate Bayesian computation (ABC) convergence. Empirical validation on the American Time Use Survey (ATUS) microdata demonstrates that our framework robustly estimates model parameters while rigorously quantifying uncertainty. This work delivers the first statistically principled and practically applicable inference framework for settings where $n$ itself must be privatized.
📝 Abstract
We develop both theory and algorithms to analyze privatized data in the unbounded differential privacy(DP), where even the sample size is considered a sensitive quantity that requires privacy protection. We show that the distance between the sampling distributions under unbounded DP and bounded DP goes to zero as the sample size $n$ goes to infinity, provided that the noise used to privatize $n$ is at an appropriate rate; we also establish that ABC-type posterior distributions converge under similar assumptions. We further give asymptotic results in the regime where the privacy budget for $n$ goes to zero, establishing similarity of sampling distributions as well as showing that the MLE in the unbounded setting converges to the bounded-DP MLE. In order to facilitate valid, finite-sample Bayesian inference on privatized data in the unbounded DP setting, we propose a reversible jump MCMC algorithm which extends the data augmentation MCMC of Ju et al. (2022). We also propose a Monte Carlo EM algorithm to compute the MLE from privatized data in both bounded and unbounded DP. We apply our methodology to analyze a linear regression model as well as a 2019 American Time Use Survey Microdata File which we model using a Dirichlet distribution.