When Are Concepts Erased From Diffusion Models?

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This study addresses the challenge of rigorously evaluating the authenticity and completeness of concept erasure in diffusion models—specifically, whether target concepts are truly removed rather than merely suppressed. To this end, we propose two novel erasure paradigms: probabilistic suppression and guidance mechanism interference. We further introduce the first multi-dimensional, verifiable evaluation framework tailored for diffusion models, integrating adversarial prompt attacks, latent-space probing, conditional generation comparison, and gradient attribution analysis. Our systematic investigation reveals, for the first time, that most existing erasure methods only attenuate concept representation activation without achieving fundamental removal; moreover, a fundamental trade-off exists between robustness against adversarial perturbations and control over unintended side effects. Experimental results demonstrate that our framework enables independent, quantitative detection of erasure failure and latent concept remnants, thereby providing both theoretical foundations and practical benchmarks for trustworthy and controllable generative modeling.

Technology Category

Application Category

📝 Abstract

Concept erasure, the ability to selectively prevent a model from generating specific concepts, has attracted growing interest, with various approaches emerging to address the challenge. However, it remains unclear how thoroughly these methods erase the target concept. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) reducing the likelihood of generating the target concept, and (ii) interfering with the model's internal guidance mechanisms. To thoroughly assess whether a concept has been truly erased from the model, we introduce a suite of independent evaluations. Our evaluation framework includes adversarial attacks, novel probing techniques, and analysis of the model's alternative generations in place of the erased concept. Our results shed light on the tension between minimizing side effects and maintaining robustness to adversarial prompts. Broadly, our work underlines the importance of comprehensive evaluation for erasure in diffusion models.

Problem

Research questions and friction points this paper is trying to address.

Assessing thoroughness of concept erasure in diffusion models

Evaluating mechanisms for reducing target concept generation

Balancing minimal side effects with adversarial prompt robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reducing target concept likelihood in diffusion models

Interfering with internal guidance mechanisms

Introducing adversarial attacks and probing techniques

🔎 Similar Papers

Hiding and Recovering Knowledge in Text-to-Image Diffusion Models via Learnable Prompts