Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Machine unlearning methods often merely suppress outputs without truly erasing knowledge, rendering them vulnerable to simple prompt-based attacks. Method: The authors propose the first prompt-attack-based evaluation paradigm for unlearning efficacy, systematically assessing eight state-of-the-art unlearning techniques across three model architectures. They introduce multi-dimensional validation: output accuracy analysis, logit-space correlation testing, probe-based knowledge detection, and cross-lingual prompt perturbations (e.g., Hindi fillers). Contribution/Results: Experiments reveal robustness of RMU and TAR, whereas ELM exhibits significant knowledge recovery—57.3% accuracy restoration under Hindi prompts—demonstrating persistent memorization. Strong correlation between logit-space similarity and output accuracy confirms residual knowledge retention in deep model layers. The authors release an open-source, reproducible evaluation framework, advocating a critical distinction between “output suppression” and genuine “knowledge removal” to establish more rigorous, reliable standards for unlearning assessment.

Technology Category

Application Category

📝 Abstract
In this work, we show that some machine unlearning methods may fail when subjected to straightforward prompt attacks. We systematically evaluate eight unlearning techniques across three model families, and employ output-based, logit-based, and probe analysis to determine to what extent supposedly unlearned knowledge can be retrieved. While methods like RMU and TAR demonstrate robust unlearning, ELM remains vulnerable to specific prompt attacks (e.g., Hindi filler text in original prompt recovering 57.3% accuracy). Our logit analysis also confirms that unlearned models are generally not hiding knowledge by modifying the way the answer is formatted, as the correlation between output and logit accuracy is strong. These results challenge prevailing assumptions about unlearning effectiveness and highlight the need for evaluation frameworks that can reliably distinguish between true knowledge removal and superficial output suppression. We also publicly make available our evaluation framework to easily evaluate prompting techniques to retrieve unlearning knowledge.
Problem

Research questions and friction points this paper is trying to address.

Evaluate unlearning methods vulnerability to prompt attacks
Assess knowledge retrieval in supposedly unlearned models
Challenge assumptions about true knowledge removal effectiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically evaluate eight unlearning techniques
Employ output-based logit-based probe analysis
Publicly available evaluation framework
🔎 Similar Papers
No similar papers found.