Safe-VAR: Safe Visual Autoregressive Model for Text-to-Image Generative Watermarking

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual autoregressive (VAR) models lack dedicated watermarking solutions, making it challenging to simultaneously achieve watermark invisibility, generation fidelity, and robustness. Method: This paper introduces the first end-to-end trainable watermarking framework tailored for text-to-image autoregressive generation. It proposes an adaptive-scale interaction module to dynamically control watermark injection timing and intensity, and pioneers a cross-scale fusion mechanism that jointly models multi-resolution features and watermark patterns via hybrid attention heads and Mixture-of-Experts (MoE) specialists. Contribution/Results: Compared to existing diffusion-based watermarking methods, our framework achieves superior generation fidelity (lower FID), significantly improved watermark imperceptibility (lower Bit Error Rate, BER), and enhanced robustness against cropping, compression, and noise. Moreover, it demonstrates strong generalization to out-of-domain watermarks such as QR codes—effectively resolving the fundamental trade-off among invisibility, fidelity, and robustness in VAR models.

Technology Category

Application Category

📝 Abstract
With the success of autoregressive learning in large language models, it has become a dominant approach for text-to-image generation, offering high efficiency and visual quality. However, invisible watermarking for visual autoregressive (VAR) models remains underexplored, despite its importance in misuse prevention. Existing watermarking methods, designed for diffusion models, often struggle to adapt to the sequential nature of VAR models. To bridge this gap, we propose Safe-VAR, the first watermarking framework specifically designed for autoregressive text-to-image generation. Our study reveals that the timing of watermark injection significantly impacts generation quality, and watermarks of different complexities exhibit varying optimal injection times. Motivated by this observation, we propose an Adaptive Scale Interaction Module, which dynamically determines the optimal watermark embedding strategy based on the watermark information and the visual characteristics of the generated image. This ensures watermark robustness while minimizing its impact on image quality. Furthermore, we introduce a Cross-Scale Fusion mechanism, which integrates mixture of both heads and experts to effectively fuse multi-resolution features and handle complex interactions between image content and watermark patterns. Experimental results demonstrate that Safe-VAR achieves state-of-the-art performance, significantly surpassing existing counterparts regarding image quality, watermarking fidelity, and robustness against perturbations. Moreover, our method exhibits strong generalization to an out-of-domain watermark dataset QR Codes.
Problem

Research questions and friction points this paper is trying to address.

Develops Safe-VAR for autoregressive text-to-image watermarking.
Addresses optimal timing for watermark injection in VAR models.
Enhances watermark robustness and image quality simultaneously.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Scale Interaction Module for watermarking
Cross-Scale Fusion mechanism for feature integration
Dynamic watermark embedding strategy optimization
🔎 Similar Papers
No similar papers found.
Z
Ziyi Wang
Zhejiang University
S
Songbai Tan
Shenzhen University
G
Gang Xu
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)
Xuerui Qiu
Xuerui Qiu
Institue of Automation, Chinese Academy of Sciences
Representation Learning3D Computer VisionModel Compression
H
Hongbin Xu
South China University of Technology
Xin Meng
Xin Meng
University of Pittsburgh
AI and medical imaging
M
Ming Li
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)
F
Fei Richard Yu
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)