Does Unlearning Truly Unlearn? A Black Box Evaluation of LLM Unlearning Methods

📅 2024-11-18

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the fundamental question of whether existing large language model (LLM) unlearning methods genuinely erase harmful knowledge. Adopting a black-box evaluation paradigm for the first time, we systematically assess two mainstream unlearning approaches—LLM Unlearning (LLMU) and Representation-level Model Unlearning (RMU)—on the WMDP benchmark and a newly constructed biosafety benchmark. Our results reveal that current methods fail to achieve true unlearning: over 90% of purportedly erased knowledge can be recovered via simple prompt rewriting or fine-tuning on irrelevant data, causing unlearning accuracy to degrade by more than an order of magnitude. RMU substantially outperforms LLMU in preserving general capabilities. Robustness evaluations across multiple benchmarks and under 5-shot adversarial attacks further expose severe vulnerabilities. We introduce the first black-box evaluation framework targeting the substantive efficacy of unlearning, challenging prevailing assumptions about method effectiveness and providing critical empirical evidence for trustworthy AI governance.

Technology Category

Application Category

📝 Abstract

Large language model unlearning aims to remove harmful information that LLMs have learnt to prevent their use for malicious purposes. LLMU and RMU have been proposed as two methods for LLM unlearning, achieving impressive results on unlearning benchmarks. We study in detail the efficacy of these methods by evaluating their impact on general model capabilities on the WMDP benchmark as well as a biology benchmark we create. Our experiments show that RMU generally leads to better preservation of model capabilities, for similar or better unlearning. We further test the robustness of these methods and find that doing 5-shot prompting or rephrasing the question in simple ways can lead to an over ten-fold increase in accuracy on unlearning benchmarks. Finally, we show that training on unrelated data can almost completely recover pre-unlearning performance, demonstrating that these methods fail at truly unlearning. The code is available at: https://github.com/JaiDoshi/Knowledge-Erasure.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLM unlearning methods' effectiveness

Assess impact on general model capabilities

Test robustness against prompting strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLM unlearning methods

Tests robustness with 5-shot prompting

Demonstrates training recovers pre-unlearning performance

🔎 Similar Papers

Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning