DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization

📅 2024-08-18
🏛️ arXiv.org
📈 Citations: 2
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the challenge of conducting pure black-box red-teaming attacks against text-to-image (T2I) diffusion models when no access to the text encoder—or any internal model components—is available. Method: We propose a gradient-free, model-agnostic, end-to-end prompt attack framework that integrates continuous–discrete prompt reparameterization (C-PRV/D-PRV) with zeroth-order optimization (ZOO), enabling efficient query-driven optimization directly in the discrete prompt space. The method requires only API-level queries—no gradient information, architectural knowledge, or encoder access. Contribution/Results: Evaluated across diverse state-of-the-art safety mechanisms and mainstream online T2I services, our approach achieves an average 8.5% higher attack success rate than existing black-box baselines. It demonstrates strong generalizability and practicality, establishing itself as an effective, automated red-teaming tool for robustness assessment of deployed T2I systems.

Technology Category

Application Category

📝 Abstract
Current text-to-image (T2I) synthesis diffusion models raise misuse concerns, particularly in creating prohibited or not-safe-for-work (NSFW) images. To address this, various safety mechanisms and red teaming attack methods are proposed to enhance or expose the T2I model's capability to generate unsuitable content. However, many red teaming attack methods assume knowledge of the text encoders, limiting their practical usage. In this work, we rethink the case of extit{purely black-box} attacks without prior knowledge of the T2l model. To overcome the unavailability of gradients and the inability to optimize attacks within a discrete prompt space, we propose DiffZOO which applies Zeroth Order Optimization to procure gradient approximations and harnesses both C-PRV and D-PRV to enhance attack prompts within the discrete prompt domain. We evaluated our method across multiple safety mechanisms of the T2I diffusion model and online servers. Experiments on multiple state-of-the-art safety mechanisms show that DiffZOO attains an 8.5% higher average attack success rate than previous works, hence its promise as a practical red teaming tool for T2l models.
Problem

Research questions and friction points this paper is trying to address.

Black-box attack on text-to-image models
Enhancing attack prompts via Zeroth Order Optimization
Improving red teaming success rate for T2I models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zeroth Order Optimization
Black-box attack method
Discrete prompt enhancement
🔎 Similar Papers
No similar papers found.
Pucheng Dang
Pucheng Dang
University of Chinese Academy of Sciences
Privacy ProtectionDNN security
X
Xing Hu
Institute of Computing Technology, Chinese Academy of Sciences; Shanghai Innovation Center for Processor Technologies, SHIC
D
Dong Li
Institute of Computing Technology, Chinese Academy of Sciences
R
Rui Zhang
Institute of Computing Technology, Chinese Academy of Sciences
Q
Qi Guo
Institute of Computing Technology, Chinese Academy of Sciences
Kaidi Xu
Kaidi Xu
Associate Professor, City University of Hong Kong
AI SecurityUncertainty QuantificationFormal Verification