🤖 AI Summary
This work exposes a critical vulnerability of diffusion model watermarks under black-box conditions: watermark forgery and removal can be achieved using only a single watermarked image, without access to model weights or training data. Methodologically, the authors identify and exploit, for the first time, the many-to-one mapping between initial noise and generated images, combined with the ingress/egress property of watermarked regions in latent space, to devise a gradient-free adversarial learning framework—integrating single-sample inversion, latent-space perturbation optimization, black-box gradient estimation, and region-boundary modeling. The attack achieves high success rates across four state-of-the-art watermarking schemes—Tree-Ring, RingID, WIND, and Gaussian Shading—on Stable Diffusion v1.4 and v2.0. This systematically reveals fundamental robustness deficiencies in existing diffusion-based watermarking methods, providing both a crucial security warning and a technical benchmark for designing next-generation verifiable generative content watermarks.
📝 Abstract
Watermarking techniques are vital for protecting intellectual property and preventing fraudulent use of media. Most previous watermarking schemes designed for diffusion models embed a secret key in the initial noise. The resulting pattern is often considered hard to remove and forge into unrelated images. In this paper, we propose a black-box adversarial attack without presuming access to the diffusion model weights. Our attack uses only a single watermarked example and is based on a simple observation: there is a many-to-one mapping between images and initial noises. There are regions in the clean image latent space pertaining to each watermark that get mapped to the same initial noise when inverted. Based on this intuition, we propose an adversarial attack to forge the watermark by introducing perturbations to the images such that we can enter the region of watermarked images. We show that we can also apply a similar approach for watermark removal by learning perturbations to exit this region. We report results on multiple watermarking schemes (Tree-Ring, RingID, WIND, and Gaussian Shading) across two diffusion models (SDv1.4 and SDv2.0). Our results demonstrate the effectiveness of the attack and expose vulnerabilities in the watermarking methods, motivating future research on improving them.