A Single Neuron Works: Precise Concept Erasure in Text-to-Image Diffusion Models

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Text-to-image diffusion models pose safety risks due to their propensity to generate harmful content, while existing concept erasure methods struggle to balance erasure precision and image fidelity. This paper proposes a single-neuron-level concept erasure framework: it constructs semantically disentangled latent representations via sparse autoencoders (SAEs) and introduces a modulation-frequency scoring mechanism to precisely identify and suppress neurons critical for encoding specific harmful concepts—enabling “surgical-grade” semantic erasure. Evaluated across multiple benchmarks, our method significantly outperforms state-of-the-art approaches, achieving superior harmful-concept removal and strong adversarial robustness, while minimally degrading generation quality for non-target concepts. To the best of our knowledge, this is the first work to realize fine-grained, interpretable, and controllable safety intervention without compromising high-fidelity image synthesis.

Technology Category

Application Category

📝 Abstract

Text-to-image models exhibit remarkable capabilities in image generation. However, they also pose safety risks of generating harmful content. A key challenge of existing concept erasure methods is the precise removal of target concepts while minimizing degradation of image quality. In this paper, we propose Single Neuron-based Concept Erasure (SNCE), a novel approach that can precisely prevent harmful content generation by manipulating only a single neuron. Specifically, we train a Sparse Autoencoder (SAE) to map text embeddings into a sparse, disentangled latent space, where individual neurons align tightly with atomic semantic concepts. To accurately locate neurons responsible for harmful concepts, we design a novel neuron identification method based on the modulated frequency scoring of activation patterns. By suppressing activations of the harmful concept-specific neuron, SNCE achieves surgical precision in concept erasure with minimal disruption to image quality. Experiments on various benchmarks demonstrate that SNCE achieves state-of-the-art results in target concept erasure, while preserving the model's generation capabilities for non-target concepts. Additionally, our method exhibits strong robustness against adversarial attacks, significantly outperforming existing methods.

Problem

Research questions and friction points this paper is trying to address.

Precisely removing harmful concepts from text-to-image models

Minimizing image quality degradation during concept erasure

Achieving surgical precision by manipulating single neurons

Innovation

Methods, ideas, or system contributions that make the work stand out.

Manipulates single neuron for precise erasure

Uses sparse autoencoder for disentangled semantic mapping

Employs modulated frequency scoring for neuron identification

🔎 Similar Papers

Hiding and Recovering Knowledge in Text-to-Image Diffusion Models via Learnable Prompts