CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

To address the challenge of precisely suppressing unsafe, copyrighted, or privacy-invasive concepts in pre-trained diffusion models, this paper proposes a training-free concept unloading framework. Our method directly identifies the token embedding subspace associated with the target concept in the weight space, analyzes its spectral structure via Singular Value Decomposition (SVD), and designs a “Spectral Eraser” that performs closed-form orthogonal projection to selectively suppress harmful concepts. The entire process requires no fine-tuning, supervision, or iterative optimization, completing editing in under two seconds. Experiments on artistic style, object, identity, and explicit content removal demonstrate substantial improvements over baselines, achieving high generation fidelity, minimal capability degradation, and strong robustness against red-teaming jailbreak attacks. The core contribution lies in the first integration of spectral analysis with closed-form weight-space editing—enabling fast, interpretable, and high-precision targeted forgetting.

Technology Category

Application Category

📝 Abstract

As Text-to-Image models continue to evolve, so does the risk of generating unsafe, copyrighted, or privacy-violating content. Existing safety interventions - ranging from training data curation and model fine-tuning to inference-time filtering and guidance - often suffer from incomplete concept removal, susceptibility to jail-breaking, computational inefficiency, or collateral damage to unrelated capabilities. In this paper, we introduce CURE, a training-free concept unlearning framework that operates directly in the weight space of pre-trained diffusion models, enabling fast, interpretable, and highly specific suppression of undesired concepts. At the core of our method is the Spectral Eraser, a closed-form, orthogonal projection module that identifies discriminative subspaces using Singular Value Decomposition over token embeddings associated with the concepts to forget and retain. Intuitively, the Spectral Eraser identifies and isolates features unique to the undesired concept while preserving safe attributes. This operator is then applied in a single step update to yield an edited model in which the target concept is effectively unlearned - without retraining, supervision, or iterative optimization. To balance the trade-off between filtering toxicity and preserving unrelated concepts, we further introduce an Expansion Mechanism for spectral regularization which selectively modulates singular vectors based on their relative significance to control the strength of forgetting. All the processes above are in closed-form, guaranteeing extremely efficient erasure in only $2$ seconds. Benchmarking against prior approaches, CURE achieves a more efficient and thorough removal for targeted artistic styles, objects, identities, or explicit content, with minor damage to original generation ability and demonstrates enhanced robustness against red-teaming.

Problem

Research questions and friction points this paper is trying to address.

Preventing unsafe or copyrighted content in diffusion models

Improving efficiency and specificity of concept removal

Balancing toxicity filtering with unrelated concept preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free concept unlearning via weight space editing

Spectral Eraser using SVD for discriminative subspaces

Closed-form Expansion Mechanism for spectral regularization

🔎 Similar Papers

Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient