Near-Optimal Pure Machine Unlearning for Smooth Strongly Convex Losses

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
This work addresses the unresolved statistical cost and lack of optimality guarantees for machine unlearning under smooth, strongly convex losses by proposing an optimal algorithm for approximate ε-machine unlearning. By establishing tight upper and lower bounds, it characterizes for the first time the excess risk incurred by unlearning in this setting and elucidates its relationship with the accuracy of retraining and differential privacy–based approaches. Theoretical analysis shows that the bounds are tight for the unit-ball mean estimation task; when ε ≫ d, the proposed method achieves exponentially better accuracy than both retraining and differential privacy baselines, whereas retraining is optimal when ε ≤ d. Integrating stochastic optimization, generalization error analysis, information-theoretic lower bounds, and a formal ε-unlearning framework, the study precisely identifies a phase transition in the unlearning penalty as a function of ε/d.
📝 Abstract
Machine unlearning is motivated by legal and user-facing requirements to remove the influence of individuals' data from trained models, such as the right to be forgotten. Prior work has developed algorithms and error bounds for unlearning in smooth strongly convex stochastic optimization, but the fundamental statistical cost of unlearning has remained unclear. We nearly resolve this problem by proving upper and lower bounds on the excess population risk of approximate $\varepsilon$-unlearning; our bounds are tight up to a condition-number factor. For mean estimation over the unit ball, our upper and lower bounds match. The optimal rate is the usual statistical error plus an unlearning penalty that interpolates between the retraining-from-scratch rate and an exponentially smaller term as $\varepsilon/d$ grows, where $d$ is the dimension of the model. In particular, when $\varepsilon \gg d$, our $\varepsilon$-unlearning algorithm offers an exponential accuracy improvement over retraining the model from scratch and differentially private baselines. On the other hand, when $\varepsilon \le d$, retraining from scratch is optimal.
Problem

Research questions and friction points this paper is trying to address.

machine unlearning
excess population risk
smooth strongly convex losses
statistical cost
right to be forgotten
Innovation

Methods, ideas, or system contributions that make the work stand out.

machine unlearning
strongly convex optimization
excess population risk
statistical lower bounds
privacy-accuracy tradeoff