Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts

📅 2024-10-16

🏛️ arXiv.org

📈 Citations: 13

✨ Influential: 1

career value

179K/year

🤖 AI Summary

To address the vulnerability of diffusion models to malicious fine-tuning—where harmful or copyrighted content is relearned after concept forgetting—this paper proposes a meta-forgetting framework. During the forgetting phase, it introduces lightweight meta-objectives optimized via gradient-aware meta-optimization, semantic correlation modeling, and adversarial fine-tuning detection to actively weaken residual semantic couplings between benign and forgotten concepts, thereby blocking relearning pathways. This work establishes the first “anti-backtracking” forgetting paradigm, requiring no architectural modifications or alterations to the original training pipeline and remaining compatible with mainstream forgetting algorithms. Evaluated on Stable Diffusion (v1-4 and SDXL), the method reduces average relearning rates by 68.3% while incurring minimal FID degradation (<2.1), preserving generation quality. Ablation studies confirm the effectiveness of each component.

Technology Category

Application Category

📝 Abstract

With the rapid progress of diffusion-based content generation, significant efforts are being made to unlearn harmful or copyrighted concepts from pretrained diffusion models (DMs) to prevent potential model misuse. However, it is observed that even when DMs are properly unlearned before release, malicious finetuning can compromise this process, causing DMs to relearn the unlearned concepts. This occurs partly because certain benign concepts (e.g.,"skin") retained in DMs are related to the unlearned ones (e.g.,"nudity"), facilitating their relearning via finetuning. To address this, we propose meta-unlearning on DMs. Intuitively, a meta-unlearned DM should behave like an unlearned DM when used as is; moreover, if the meta-unlearned DM undergoes malicious finetuning on unlearned concepts, the related benign concepts retained within it will be triggered to self-destruct, hindering the relearning of unlearned concepts. Our meta-unlearning framework is compatible with most existing unlearning methods, requiring only the addition of an easy-to-implement meta objective. We validate our approach through empirical experiments on meta-unlearning concepts from Stable Diffusion models (SD-v1-4 and SDXL), supported by extensive ablation studies. Our code is available at https://github.com/sail-sg/Meta-Unlearning.

Problem

Research questions and friction points this paper is trying to address.

Preventing relearning of unlearned concepts in diffusion models

Addressing malicious finetuning compromising unlearning efforts

Ensuring self-destruction of related benign concepts during misuse

Innovation

Methods, ideas, or system contributions that make the work stand out.

Meta-unlearning prevents relearning via self-destruct mechanism

Compatible with existing unlearning methods via meta objective

Validated on Stable Diffusion models with ablation studies

🔎 Similar Papers

Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient