๐ค AI Summary
This work addresses the fundamental question of whether existing large language model (LLM) unlearning methods genuinely erase harmful knowledge. Adopting a black-box evaluation paradigm for the first time, we systematically assess two mainstream unlearning approachesโLLM Unlearning (LLMU) and Representation-level Model Unlearning (RMU)โon the WMDP benchmark and a newly constructed biosafety benchmark. Our results reveal that current methods fail to achieve true unlearning: over 90% of purportedly erased knowledge can be recovered via simple prompt rewriting or fine-tuning on irrelevant data, causing unlearning accuracy to degrade by more than an order of magnitude. RMU substantially outperforms LLMU in preserving general capabilities. Robustness evaluations across multiple benchmarks and under 5-shot adversarial attacks further expose severe vulnerabilities. We introduce the first black-box evaluation framework targeting the substantive efficacy of unlearning, challenging prevailing assumptions about method effectiveness and providing critical empirical evidence for trustworthy AI governance.
๐ Abstract
Large language model unlearning aims to remove harmful information that LLMs have learnt to prevent their use for malicious purposes. LLMU and RMU have been proposed as two methods for LLM unlearning, achieving impressive results on unlearning benchmarks. We study in detail the efficacy of these methods by evaluating their impact on general model capabilities on the WMDP benchmark as well as a biology benchmark we create. Our experiments show that RMU generally leads to better preservation of model capabilities, for similar or better unlearning. We further test the robustness of these methods and find that doing 5-shot prompting or rephrasing the question in simple ways can lead to an over ten-fold increase in accuracy on unlearning benchmarks. Finally, we show that training on unrelated data can almost completely recover pre-unlearning performance, demonstrating that these methods fail at truly unlearning. The code is available at: https://github.com/JaiDoshi/Knowledge-Erasure.