Breaking the Gold Standard: Extracting Forgotten Data under Exact Unlearning in Large Language Models

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work reveals that “exact unlearning”—such as retraining from scratch to remove data—in large language models (LLMs) fails to guarantee privacy and may even amplify leakage risks. To demonstrate this, we propose a novel data extraction attack that leverages collaborative discrepancy signals between pre- and post-unlearning model checkpoints to guide adversarial generation, augmented by context-aware dynamic token filtering to recover “deleted” sensitive information with high fidelity. We provide the first theoretical proof that exact unlearning exacerbates privacy vulnerabilities and introduce the first attack paradigm integrating cross-checkpoint model guidance with adaptive token filtering. Evaluated on benchmarks including MUSE, TOFU, and WMDP, as well as synthetic medical data, our method achieves up to 100% improvement in extraction success rate and robustly reconstructs forgotten diagnostic patterns—demonstrating severe practical privacy threats in real-world deployment scenarios.

Technology Category

Application Category

📝 Abstract

Large language models are typically trained on datasets collected from the web, which may inadvertently contain harmful or sensitive personal information. To address growing privacy concerns, unlearning methods have been proposed to remove the influence of specific data from trained models. Of these, exact unlearning -- which retrains the model from scratch without the target data -- is widely regarded the gold standard, believed to be robust against privacy-related attacks. In this paper, we challenge this assumption by introducing a novel data extraction attack that compromises even exact unlearning. Our method leverages both the pre- and post-unlearning models: by guiding the post-unlearning model using signals from the pre-unlearning model, we uncover patterns that reflect the removed data distribution. Combining model guidance with a token filtering strategy, our attack significantly improves extraction success rates -- doubling performance in some cases -- across common benchmarks such as MUSE, TOFU, and WMDP. Furthermore, we demonstrate our attack's effectiveness on a simulated medical diagnosis dataset to highlight real-world privacy risks associated with exact unlearning. In light of our findings, which suggest that unlearning may, in a contradictory way, increase the risk of privacy leakage, we advocate for evaluation of unlearning methods to consider broader threat models that account not only for post-unlearning models but also for adversarial access to prior checkpoints.

Problem

Research questions and friction points this paper is trying to address.

Exact unlearning fails to prevent sensitive data extraction

Novel attack exploits pre- and post-unlearning model differences

Unlearning methods may inadvertently increase privacy leakage risks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages pre- and post-unlearning model signals

Uses model guidance with token filtering

Demonstrates attack on medical diagnosis dataset

🔎 Similar Papers

Unveiling Entity-Level Unlearning for Large Language Models: A Comprehensive Analysis