🤖 AI Summary
This work addresses the challenge of precisely unlearning specific undesirable knowledge in large language models while preserving other capabilities. The authors propose a projection-constrained low-rank adaptation framework that, for the first time, incorporates null-space constraints of a retention subspace into response-targeted unlearning. By guiding the unlearning process with safety-oriented supervision and restricting parameter updates to the null space of the retention subspace, the method enables decoupled optimization of forgetting and retention objectives. Built upon LoRA with orthogonal projection, retention subspace estimation, and a joint loss design, the approach significantly suppresses the extraction of harmful knowledge on the TOFU and WMDP benchmarks, while maintaining MMLU performance and simultaneously improving accuracy on retained tasks, overall utility, and safety alignment.
📝 Abstract
Large language model unlearning aims to suppress designated undesirable knowledge while preserving benign capabilities. Many unlearning objectives focus on suppressing undesired answers, while recent target-guided variants specify replacement behavior but still leave update locality largely unconstrained. This paper introduces \emph{Null-Space Constrained Response-Specified Unlearning} (NSRU), a projection-constrained low-rank framework for controlled LLM unlearning. NSRU uses an explicitly structured safe target response to specify the desired behavior for each forget query, while suppressing the original undesired content. To localize adaptation, NSRU estimates per-module retain subspaces from benign hidden representations and uses an orthogonal-projected low-rank parameterization to confine LoRA updates to the null space of the retain subspace. The resulting objective jointly optimizes safe-target learning, undesired-response suppression, and retention preservation under this constrained parameterization. We provide a local first-order analysis showing that the projected update reduces retain-side perturbations while preserving editable directions for shaping forget-query behavior. Experiments on TOFU show that NSRU effectively suppresses extractable forget-set knowledge while improving retain QA performance, model utility, and safe-target alignment over representative baselines. On WMDP, NSRU keeps hazardous-domain accuracy near the random-choice region while preserving broad and domain-adjacent MMLU utility. Ablation studies support the complementary roles of safe-target supervision, undesired-response suppression, retention loss, and null-space projected updates, while sensitivity and robustness analyses indicate stable behavior across the tested hyperparameter and prompt variations.