$\textbf{AGT$^{AO}$}$: Robust and Stabilized LLM Unlearning via Adversarial Gating Training with Adaptive Orthogonality

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the privacy risks posed by large language models unintentionally memorizing sensitive training data, a challenge exacerbated by the difficulty of existing unlearning methods to simultaneously ensure complete removal of such information while preserving model utility. To this end, the authors propose AGT$^{AO}$, a unified framework that formulates unlearning as a minimax game in latent space. The approach employs adversarial gating training to resist internal reconstruction attempts and introduces an adaptive orthogonality mechanism to dynamically mitigate gradient conflicts between forgetting and retention objectives. A curriculum-based gating strategy further enhances robustness. Experimental results demonstrate that AGT$^{AO}$ achieves highly effective unlearning (KUR ≈ 0.01) while maintaining strong model performance (MMLU score of 58.30), significantly outperforming current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
While Large Language Models (LLMs) have achieved remarkable capabilities, they unintentionally memorize sensitive data, posing critical privacy and security risks. Machine unlearning is pivotal for mitigating these risks, yet existing paradigms face a fundamental dilemma: aggressive unlearning often induces catastrophic forgetting that degrades model utility, whereas conservative strategies risk superficial forgetting, leaving models vulnerable to adversarial recovery. To address this trade-off, we propose $\textbf{AGT$^{AO}$}$ (Adversarial Gating Training with Adaptive Orthogonality), a unified framework designed to reconcile robust erasure with utility preservation. Specifically, our approach introduces $\textbf{Adaptive Orthogonality (AO)}$ to dynamically mitigate geometric gradient conflicts between forgetting and retention objectives, thereby minimizing unintended knowledge degradation. Concurrently, $\textbf{Adversarial Gating Training (AGT)}$ formulates unlearning as a latent-space min-max game, employing a curriculum-based gating mechanism to simulate and counter internal recovery attempts. Extensive experiments demonstrate that $\textbf{AGT$^{AO}$}$ achieves a superior trade-off between unlearning efficacy (KUR $\approx$ 0.01) and model utility (MMLU 58.30). Code is available at https://github.com/TiezMind/AGT-unlearning.
Problem

Research questions and friction points this paper is trying to address.

machine unlearning
catastrophic forgetting
privacy risk
adversarial recovery
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Machine Unlearning
Adaptive Orthogonality
Adversarial Gating Training
Large Language Models
Catastrophic Forgetting
🔎 Similar Papers
No similar papers found.
P
Pengyu Li
School of Computer Science and Technology, Xi’an Jiaotong University, China
Lingling Zhang
Lingling Zhang
Assistant Professor, Xi'an Jiaotong University
Computer visionFew-shot learningZero-shot learning
Z
Zhitao Gao
School of Computer Science and Technology, Xi’an Jiaotong University, China
Y
Yanrui Wu
School of Computer Science and Technology, Xi’an Jiaotong University, China
Y
Yuxuan Dong
School of Computer Science and Technology, Xi’an Jiaotong University, China
H
Huan Liu
School of Computer Science and Technology, Xi’an Jiaotong University, China
B
Bifan Wei
School of Computer Science and Technology, Xi’an Jiaotong University, China
J
Jun Liu
School of Computer Science and Technology, Xi’an Jiaotong University, China