🤖 AI Summary
To address the critical lack of intrinsic, lossless, and spatiotemporally fine-grained watermarking mechanisms for AI-generated videos, this paper proposes VideoShield—a novel end-to-end intrinsic watermarking framework. VideoShield embeds watermarks directly during diffusion-based video generation, eliminating post-processing artifacts and preserving output fidelity. It introduces the first noise-domain watermark mapping and template-bit encoding mechanism grounded in DDIM inversion, coupled with spatiotemporal consistency verification, enabling universal, zero-shot, fine-tuning-free watermark embedding with no training overhead. The framework is compatible with both text-to-video (T2V) and image-to-video (I2V) models and generalizes to image generation. Experiments demonstrate lossless video fidelity—PSNR and SSIM remain identical to the original—robust watermark extraction, and 96.2% accuracy in localized tampering detection.
📝 Abstract
Artificial Intelligence Generated Content (AIGC) has advanced significantly, particularly with the development of video generation models such as text-to-video (T2V) models and image-to-video (I2V) models. However, like other AIGC types, video generation requires robust content control. A common approach is to embed watermarks, but most research has focused on images, with limited attention given to videos. Traditional methods, which embed watermarks frame-by-frame in a post-processing manner, often degrade video quality. In this paper, we propose VideoShield, a novel watermarking framework specifically designed for popular diffusion-based video generation models. Unlike post-processing methods, VideoShield embeds watermarks directly during video generation, eliminating the need for additional training. To ensure video integrity, we introduce a tamper localization feature that can detect changes both temporally (across frames) and spatially (within individual frames). Our method maps watermark bits to template bits, which are then used to generate watermarked noise during the denoising process. Using DDIM Inversion, we can reverse the video to its original watermarked noise, enabling straightforward watermark extraction. Additionally, template bits allow precise detection for potential temporal and spatial modification. Extensive experiments across various video models (both T2V and I2V models) demonstrate that our method effectively extracts watermarks and detects tamper without compromising video quality. Furthermore, we show that this approach is applicable to image generation models, enabling tamper detection in generated images as well. Codes and models are available at href{https://github.com/hurunyi/VideoShield}{https://github.com/hurunyi/VideoShield}.