When Do Fewer Coordinates Suffice in DP-SGD?

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the issue in differentially private stochastic gradient descent (DP-SGD) where injected noise scales linearly with model dimensionality, degrading utility. To mitigate this, the authors propose TP-TopK, a two-phase mechanism that operates without public data: a private warm-up phase identifies a support set of important coordinates, followed by a main training phase that updates only those coordinates, thereby reducing the effective noise dimensionality from the full dimension $d$ to a sparse subset $k$. Theoretical analysis establishes, for the first time, formal criteria for the efficacy of coordinate sparsification under differential privacy and provides a lower bound on the reliability of coordinate ranking in the warm-up phase. Experiments on MNIST, FMNIST, and CIFAR-10 demonstrate that the learned support sets consistently outperform random subsets of equal size, particularly when $k$ is small and the warm-up phase captures sufficient gradient information, preserving significantly more gradient energy.

📝 Abstract

Differentially private stochastic gradient descent (DP-SGD) injects noise into every updated coordinate, making the injected noise energy scale with the ambient parameter dimension $d$. We ask when private training can update fewer coordinates without losing the signal needed for optimization. We propose \textsc{TP-TopK} (Two-Phase TopK DP-SGD), a two-phase method for coordinate-sparse private training without public data, in which a private warm-up phase identifies a coordinate support used to guide the main training phase. We give a criterion characterizing when coordinate restriction can be beneficial, show via a nonconvex stationarity bound that under this condition the relevant noise term scales with the active dimension $k$ rather than the full parameter dimension $d$, and provide a lower bound on the reliability of warm-up-based coordinate ranking. Experiments on MNIST, FMNIST, and CIFAR-10 show that learned coordinate supports can retain more gradient energy than size-matched random supports, with the largest gains when the active dimension is small and warm-up scores are informative.

Problem

Research questions and friction points this paper is trying to address.

DP-SGD

coordinate sparsity

differential privacy

gradient energy

parameter dimension

Innovation

Methods, ideas, or system contributions that make the work stand out.

DP-SGD

coordinate sparsity

two-phase training