🤖 AI Summary
Machine unlearning methods often merely suppress outputs without truly erasing knowledge, rendering them vulnerable to simple prompt-based attacks. Method: The authors propose the first prompt-attack-based evaluation paradigm for unlearning efficacy, systematically assessing eight state-of-the-art unlearning techniques across three model architectures. They introduce multi-dimensional validation: output accuracy analysis, logit-space correlation testing, probe-based knowledge detection, and cross-lingual prompt perturbations (e.g., Hindi fillers). Contribution/Results: Experiments reveal robustness of RMU and TAR, whereas ELM exhibits significant knowledge recovery—57.3% accuracy restoration under Hindi prompts—demonstrating persistent memorization. Strong correlation between logit-space similarity and output accuracy confirms residual knowledge retention in deep model layers. The authors release an open-source, reproducible evaluation framework, advocating a critical distinction between “output suppression” and genuine “knowledge removal” to establish more rigorous, reliable standards for unlearning assessment.
📝 Abstract
In this work, we show that some machine unlearning methods may fail when subjected to straightforward prompt attacks. We systematically evaluate eight unlearning techniques across three model families, and employ output-based, logit-based, and probe analysis to determine to what extent supposedly unlearned knowledge can be retrieved. While methods like RMU and TAR demonstrate robust unlearning, ELM remains vulnerable to specific prompt attacks (e.g., Hindi filler text in original prompt recovering 57.3% accuracy). Our logit analysis also confirms that unlearned models are generally not hiding knowledge by modifying the way the answer is formatted, as the correlation between output and logit accuracy is strong. These results challenge prevailing assumptions about unlearning effectiveness and highlight the need for evaluation frameworks that can reliably distinguish between true knowledge removal and superficial output suppression. We also publicly make available our evaluation framework to easily evaluate prompting techniques to retrieve unlearning knowledge.