🤖 AI Summary
This work addresses a critical oversight in existing knowledge erasure methods—the frequent neglect of the embedding layer—which renders erased knowledge vulnerable to recovery via adversarial prompts or relearning. The study is the first to highlight the pivotal role of the embedding layer in effective knowledge removal and introduces EMBER, a plug-and-play, embedding-level intervention module. EMBER leverages sparse matrix factorization to precisely identify and edit word embeddings associated with target concepts. It seamlessly integrates with existing parameter-update-based erasure techniques and significantly enhances both robustness and specificity of erasure on Gemma-2-2B-it and Llama-3.1-8B-Instruct models: relearning-based recovery accuracy is reduced by up to 50%, remaining below 35%, while preserving linguistic coherence for nearly all tokens except a minimal set of concept-specific words.
📝 Abstract
As language models are increasingly deployed in real-world applications, the ability to erase specific knowledge from them becomes critical for safety and compliance. Prominent methods seek persistent removal by updating the model's parameters, yet the target knowledge often can be recovered through adversarial prompting or relearning. In this work, we hypothesize this limitation stems in part from existing methods overlooking the embedding layer. To address this, we introduce EMBedding ERasure (EMBER), a plug-n-play erasure module that leverages Sparse Matrix Factorization for precise erasure of concept-related features from token embeddings. Through comprehensive evaluations across diverse concepts on Gemma-2-2B-it and Llama-3.1-8B-Instruct, we find that augmenting existing methods with EMBER consistently improves erasure efficacy and specificity across task formats, with minimal coherence loss. Moreover, it dramatically improves robustness to relearning, reducing regained accuracy by up to 50%, limiting it to 35% on Llama compared to 70%-76% for prior methods. Further analysis shows that the coherence cost is localized, affecting only a small set of concept-exclusive tokens. Our work establishes that precise embedding-level intervention is necessary for robust concept erasure, and demonstrates that existing methods can benefit from such augmentation.