Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

Machine unlearning methods often merely suppress outputs without truly erasing knowledge, rendering them vulnerable to simple prompt-based attacks. Method: The authors propose the first prompt-attack-based evaluation paradigm for unlearning efficacy, systematically assessing eight state-of-the-art unlearning techniques across three model architectures. They introduce multi-dimensional validation: output accuracy analysis, logit-space correlation testing, probe-based knowledge detection, and cross-lingual prompt perturbations (e.g., Hindi fillers). Contribution/Results: Experiments reveal robustness of RMU and TAR, whereas ELM exhibits significant knowledge recovery—57.3% accuracy restoration under Hindi prompts—demonstrating persistent memorization. Strong correlation between logit-space similarity and output accuracy confirms residual knowledge retention in deep model layers. The authors release an open-source, reproducible evaluation framework, advocating a critical distinction between “output suppression” and genuine “knowledge removal” to establish more rigorous, reliable standards for unlearning assessment.

Technology Category

Application Category

📝 Abstract

In this work, we show that some machine unlearning methods may fail when subjected to straightforward prompt attacks. We systematically evaluate eight unlearning techniques across three model families, and employ output-based, logit-based, and probe analysis to determine to what extent supposedly unlearned knowledge can be retrieved. While methods like RMU and TAR demonstrate robust unlearning, ELM remains vulnerable to specific prompt attacks (e.g., Hindi filler text in original prompt recovering 57.3% accuracy). Our logit analysis also confirms that unlearned models are generally not hiding knowledge by modifying the way the answer is formatted, as the correlation between output and logit accuracy is strong. These results challenge prevailing assumptions about unlearning effectiveness and highlight the need for evaluation frameworks that can reliably distinguish between true knowledge removal and superficial output suppression. We also publicly make available our evaluation framework to easily evaluate prompting techniques to retrieve unlearning knowledge.

Problem

Research questions and friction points this paper is trying to address.

Evaluate unlearning methods vulnerability to prompt attacks

Assess knowledge retrieval in supposedly unlearned models

Challenge assumptions about true knowledge removal effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically evaluate eight unlearning techniques

Employ output-based logit-based probe analysis

Publicly available evaluation framework

🔎 Similar Papers

An Adversarial Perspective on Machine Unlearning for AI Safety