🤖 AI Summary
This work addresses the privacy risks posed by large language models unintentionally memorizing sensitive training data, a challenge exacerbated by the difficulty of existing unlearning methods to simultaneously ensure complete removal of such information while preserving model utility. To this end, the authors propose AGT$^{AO}$, a unified framework that formulates unlearning as a minimax game in latent space. The approach employs adversarial gating training to resist internal reconstruction attempts and introduces an adaptive orthogonality mechanism to dynamically mitigate gradient conflicts between forgetting and retention objectives. A curriculum-based gating strategy further enhances robustness. Experimental results demonstrate that AGT$^{AO}$ achieves highly effective unlearning (KUR ≈ 0.01) while maintaining strong model performance (MMLU score of 58.30), significantly outperforming current state-of-the-art methods.
📝 Abstract
While Large Language Models (LLMs) have achieved remarkable capabilities, they unintentionally memorize sensitive data, posing critical privacy and security risks. Machine unlearning is pivotal for mitigating these risks, yet existing paradigms face a fundamental dilemma: aggressive unlearning often induces catastrophic forgetting that degrades model utility, whereas conservative strategies risk superficial forgetting, leaving models vulnerable to adversarial recovery. To address this trade-off, we propose $\textbf{AGT$^{AO}$}$ (Adversarial Gating Training with Adaptive Orthogonality), a unified framework designed to reconcile robust erasure with utility preservation. Specifically, our approach introduces $\textbf{Adaptive Orthogonality (AO)}$ to dynamically mitigate geometric gradient conflicts between forgetting and retention objectives, thereby minimizing unintended knowledge degradation. Concurrently, $\textbf{Adversarial Gating Training (AGT)}$ formulates unlearning as a latent-space min-max game, employing a curriculum-based gating mechanism to simulate and counter internal recovery attempts. Extensive experiments demonstrate that $\textbf{AGT$^{AO}$}$ achieves a superior trade-off between unlearning efficacy (KUR $\approx$ 0.01) and model utility (MMLU 58.30). Code is available at https://github.com/TiezMind/AGT-unlearning.