π€ AI Summary
Privacy leakage and copyright infringement risks arise from large language modelsβ (LLMs) memorization of training data. To address this, we propose an activation-space-directed intervention method for memory suppression that operates during inference without requiring model retraining, enabling controllable attenuation of specific memorized content. Our work is the first to systematically validate the efficacy of activation steering for memory suppression, revealing a tunable trade-off between suppression strength and linguistic fluency. Empirical evaluation on the Gemma architecture demonstrates a substantial reduction in memory regeneration rate, with less than 2% degradation in overall performance and seamless plug-and-play deployment. We further introduce a controlled literary memory benchmark to rigorously assess memory suppression capabilities. This study establishes a lightweight, efficient, and generalization-friendly paradigm for enhancing LLM privacy, offering a practical alternative to costly retraining or fine-tuning approaches.
π Abstract
The memorization of training data by Large Language Models (LLMs) poses significant risks, including privacy leaks and the regurgitation of copyrighted content. Activation steering, a technique that directly intervenes in model activations, has emerged as a promising approach for manipulating LLMs. In this work, we explore the effectiveness of activation steering in reducing memorization while preserving generalization capabilities. We conduct empirical evaluations using a controlled memorization benchmark of literary material and demonstrate that our method successfully suppresses memorized content with minimal degradation in model performance in Gemma. Additionally, we analyze the trade-offs between suppression effectiveness and linguistic fluency, highlighting the advantages and limitations of activation-based interventions. Our findings contribute to ongoing efforts in developing safer and more privacy-preserving LLMs by providing a practical and efficient mechanism to mitigate unintended memorization.