🤖 AI Summary
This work addresses the growing threat of highly realistic AI-generated video disinformation by proposing a reconstruction-guided detection framework that overcomes the limited generalization of existing methods. The approach introduces a pretrained wavelet-based variational autoencoder (WF-VAE) into video forgery detection for the first time, leveraging frame-level reconstruction errors to capture spatial artifacts and aligning them with multi-frame semantic features. To model the temporal dynamics of both reconstruction errors and semantic content across frames, the framework further incorporates a Mamba module, enabling effective video-level discrimination. Extensive experiments demonstrate that the proposed method achieves superior detection performance and strong generalization across diverse generative models and challenging cross-domain scenarios.
📝 Abstract
AI-generated videos are becoming increasingly realistic, raising serious concerns about misinformation, content authenticity, and media trust. Reliable AI-generated video detection is therefore essential for multimedia forensics, yet remains challenging due to the need to capture spatial artifacts, temporal dynamics, and generalize to evolving generative models. In this paper, we explore reconstruction error as a discriminative forensic cue for AI-generated video detection. By reconstructing input videos with a pretrained WF-VAE, we observe that real and generated videos exhibit distinguishable frame-wise reconstruction error patterns, suggesting that reconstruction errors can reveal their distributional discrepancies. However, extending reconstruction-based image detection to videos is non-trivial, since video reconstruction errors are temporally organized across frames and require semantic context for effective interpretation. To address these challenges, we propose ReConFuse, a reconstruction-guided semantic fusion framework for video-level AI-generated video detection. ReConFuse extracts reconstruction error cues from WF-VAE reconstructed videos, aligns them with multi-frame semantic features, and uses a Mamba-based module to model temporal evolution for video-level classification. Experiments across multiple generators and evaluation settings demonstrate the effectiveness and strong generalization ability of ReConFuse.