🤖 AI Summary
Text-to-image diffusion models pose safety risks due to their propensity to generate harmful content, while existing concept erasure methods struggle to balance erasure precision and image fidelity. This paper proposes a single-neuron-level concept erasure framework: it constructs semantically disentangled latent representations via sparse autoencoders (SAEs) and introduces a modulation-frequency scoring mechanism to precisely identify and suppress neurons critical for encoding specific harmful concepts—enabling “surgical-grade” semantic erasure. Evaluated across multiple benchmarks, our method significantly outperforms state-of-the-art approaches, achieving superior harmful-concept removal and strong adversarial robustness, while minimally degrading generation quality for non-target concepts. To the best of our knowledge, this is the first work to realize fine-grained, interpretable, and controllable safety intervention without compromising high-fidelity image synthesis.
📝 Abstract
Text-to-image models exhibit remarkable capabilities in image generation. However, they also pose safety risks of generating harmful content. A key challenge of existing concept erasure methods is the precise removal of target concepts while minimizing degradation of image quality. In this paper, we propose Single Neuron-based Concept Erasure (SNCE), a novel approach that can precisely prevent harmful content generation by manipulating only a single neuron. Specifically, we train a Sparse Autoencoder (SAE) to map text embeddings into a sparse, disentangled latent space, where individual neurons align tightly with atomic semantic concepts. To accurately locate neurons responsible for harmful concepts, we design a novel neuron identification method based on the modulated frequency scoring of activation patterns. By suppressing activations of the harmful concept-specific neuron, SNCE achieves surgical precision in concept erasure with minimal disruption to image quality. Experiments on various benchmarks demonstrate that SNCE achieves state-of-the-art results in target concept erasure, while preserving the model's generation capabilities for non-target concepts. Additionally, our method exhibits strong robustness against adversarial attacks, significantly outperforming existing methods.