🤖 AI Summary
To address privacy leakage and opinion manipulation risks arising from diffusion model customization, existing adversarial attack methods treat text prompts and image inputs in isolation, overlooking their coupling and the synergistic vulnerabilities within UNet’s attention mechanisms. This paper proposes DADiff, a two-stage adversarial attack framework that (1) enables end-to-end joint optimization of prompt-level perturbations and image-level adversarial examples; (2) innovatively disrupts pixel–prompt correlations in both self-attention and cross-attention modules of UNet; and (3) introduces a locally randomized timestep gradient ensembling strategy. Evaluated on mainstream face datasets, DADiff achieves 10–30% improvements in anti-customization performance across diverse scenarios—including cross-prompt generalization, keyword mismatch, cross-model transfer, and cross-architecture robustness—thereby significantly enhancing robust defense capabilities against diffusion-based customization services.
📝 Abstract
The fine-tuning technique for text-to-image diffusion models facilitates image customization but risks privacy breaches and opinion manipulation. Current research focuses on prompt- or image-level adversarial attacks for anti-customization, yet it overlooks the correlation between these two levels and the relationship between internal modules and inputs. This hinders anti-customization performance in practical threat scenarios. We propose Dual Anti-Diffusion (DADiff), a two-stage adversarial attack targeting diffusion customization, which, for the first time, integrates the adversarial prompt-level attack into the generation process of image-level adversarial examples. In stage 1, we generate prompt-level adversarial vectors to guide the subsequent image-level attack. In stage 2, besides conducting the end-to-end attack on the UNet model, we disrupt its self- and cross-attention modules, aiming to break the correlations between image pixels and align the cross-attention results computed using instance prompts and adversarial prompt vectors within the images. Furthermore, we introduce a local random timestep gradient ensemble strategy, which updates adversarial perturbations by integrating random gradients from multiple segmented timesets. Experimental results on various mainstream facial datasets demonstrate 10%-30% improvements in cross-prompt, keyword mismatch, cross-model, and cross-mechanism anti-customization with DADiff compared to existing methods.