π€ AI Summary
Diffusion models generate high-fidelity images but often produce NSFW content and reinforce societal biases, hindering real-world deployment. To address this, we propose a safety-constrained framework operating directly in the text embedding spaceβwithout altering the original prompt. Our method introduces a novel self-discovered semantic direction vector mechanism that geometrically steers text embeddings toward predefined safe regions. By initializing direction vectors via LoRA and jointly fine-tuning for safety, we achieve low-intrusiveness. Evaluated across multiple benchmarks, our approach significantly reduces NSFW generation (average β62.3%) and bias metrics (e.g., Stereoset β41.7%), while preserving image fidelity (FID change < 0.8). It outperforms existing mainstream safety mitigation methods in both effectiveness and preservation of generation quality.
π Abstract
The remarkable ability of diffusion models to generate high-fidelity images has led to their widespread adoption. However, concerns have also arisen regarding their potential to produce Not Safe for Work (NSFW) content and exhibit social biases, hindering their practical use in real-world applications. In response to this challenge, prior work has focused on employing security filters to identify and exclude toxic text, or alternatively, fine-tuning pre-trained diffusion models to erase sensitive concepts. Unfortunately, existing methods struggle to achieve satisfactory performance in the sense that they can have a significant impact on the normal model output while still failing to prevent the generation of harmful content in some cases. In this paper, we propose a novel self-discovery approach to identifying a semantic direction vector in the embedding space to restrict text embedding within a safe region. Our method circumvents the need for correcting individual words within the input text and steers the entire text prompt towards a safe region in the embedding space, thereby enhancing model robustness against all possibly unsafe prompts. In addition, we employ Low-Rank Adaptation (LoRA) for semantic direction vector initialization to reduce the impact on the model performance for other semantics. Furthermore, our method can also be integrated with existing methods to improve their social responsibility. Extensive experiments on benchmark datasets demonstrate that our method can effectively reduce NSFW content and mitigate social bias generated by diffusion models compared to several state-of-the-art baselines. WARNING:This paper contains model-generated images that may be potentially offensive.