🤖 AI Summary
This work addresses the vulnerability of existing machine unlearning methods, which often inadvertently recover supposedly forgotten content under benign fine-tuning, thereby compromising unlearning efficacy. Through systematic analysis, we identify syntactic similarity—not topical relevance—as the primary driver of such unintended relearning. We thus reveal, for the first time, that syntactic structure constitutes a hidden mechanism underlying unlearning failure. Building on this insight, we propose a novel paradigm that suppresses relearning through syntactic diversification. Our approach integrates syntactically diverse rewriting, alignment analysis of representations and gradients, and unlearning optimization grounded in structural heterogeneity. Experiments demonstrate that our method significantly mitigates benign relearning, accelerates the unlearning process, and effectively alleviates the trade-off between unlearning effectiveness and model utility.
📝 Abstract
Machine unlearning aims to remove specific content from trained models while preserving overall performance. However, the phenomenon of benign relearning, in which forgotten information reemerges even from benign fine-tuning data, reveals that existing unlearning methods remain fundamentally fragile. A common explanation attributes this effect to topical relevance, but we find this account insufficient. Through systematic analysis, we demonstrate that syntactic similarity, rather than topicality, is the primary driver: across benchmarks, syntactically similar data consistently trigger recovery even without topical overlap, due to their alignment in representations and gradients with the forgotten content. Motivated by this insight, we introduce syntactic diversification, which paraphrases the original forget queries into heterogeneous structures prior to unlearning. This approach effectively suppresses benign relearning, accelerates forgetting, and substantially alleviates the trade-off between unlearning efficacy and model utility.