Understanding the Dilemma of Unlearning for Large Language Models

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

LLMs face a fundamental dilemma in knowledge forgetting: target knowledge is difficult to erase completely (often recoverable via adversarial prompts), yet aggressive forgetting methods frequently induce catastrophic forgetting, degrading general capabilities. Existing approaches lack interpretable analysis of dynamic knowledge evolution during forgetting, leading to inaccurate efficacy evaluation. This paper proposes unPact, the first framework integrating prompt-level attribution with contribution tracking to systematically dissect the evolving roles of key tokens and the mechanisms underlying performance degradation during forgetting. Through extensive multi-method, multi-model comparative experiments, we quantitatively demonstrate that most “forgetting” is merely superficial suppression—readily reversible with minimal prompt perturbations—while excessive intervention directly disrupts the structural integrity of knowledge representations. unPact establishes the first mechanism-interpretable evaluation paradigm for forgetting learning, rigorously characterizing the inherent trade-off between forgetting effectiveness and model robustness.

Technology Category

Application Category

📝 Abstract

Unlearning seeks to remove specific knowledge from large language models (LLMs), but its effectiveness remains contested. On one side, "forgotten" knowledge can often be recovered through interventions such as light fine-tuning; on the other side, unlearning may induce catastrophic forgetting that degrades general capabilities. Despite active exploration of unlearning methods, interpretability analyses of the mechanism are scarce due to the difficulty of tracing knowledge in LLMs' complex architectures. We address this gap by proposing unPact, an interpretable framework for unlearning via prompt attribution and contribution tracking. Typically, it quantifies each prompt token's influence on outputs, enabling pre- and post-unlearning comparisons to reveal what changes. Across six mainstream unlearning methods, three LLMs, and three benchmarks, we find that: (1) Unlearning appears to be effective by disrupting focus on keywords in prompt; (2) Much of the knowledge is not truly erased and can be recovered by simply emphasizing these keywords in prompts, without modifying the model's weights; (3) Catastrophic forgetting arises from indiscriminate penalization of all tokens. Taken together, our results suggest an unlearning dilemma: existing methods tend either to be insufficient - knowledge remains recoverable by keyword emphasis, or overly destructive - general performance collapses due to catastrophic forgetting, still leaving a gap to reliable unlearning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating unlearning effectiveness in large language models

Analyzing knowledge recovery through prompt interventions

Investigating catastrophic forgetting during unlearning processes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework tracks token influence for unlearning

Compares pre-post unlearning via prompt attribution

Reveals knowledge recovery through keyword emphasis

🔎 Similar Papers

No similar papers found.

Authors to Follow