ReConFuse: Reconstruction-Error Guided Semantic Fusion for AI-Generated Video Detection

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the growing threat of highly realistic AI-generated video disinformation by proposing a reconstruction-guided detection framework that overcomes the limited generalization of existing methods. The approach introduces a pretrained wavelet-based variational autoencoder (WF-VAE) into video forgery detection for the first time, leveraging frame-level reconstruction errors to capture spatial artifacts and aligning them with multi-frame semantic features. To model the temporal dynamics of both reconstruction errors and semantic content across frames, the framework further incorporates a Mamba module, enabling effective video-level discrimination. Extensive experiments demonstrate that the proposed method achieves superior detection performance and strong generalization across diverse generative models and challenging cross-domain scenarios.

📝 Abstract

AI-generated videos are becoming increasingly realistic, raising serious concerns about misinformation, content authenticity, and media trust. Reliable AI-generated video detection is therefore essential for multimedia forensics, yet remains challenging due to the need to capture spatial artifacts, temporal dynamics, and generalize to evolving generative models. In this paper, we explore reconstruction error as a discriminative forensic cue for AI-generated video detection. By reconstructing input videos with a pretrained WF-VAE, we observe that real and generated videos exhibit distinguishable frame-wise reconstruction error patterns, suggesting that reconstruction errors can reveal their distributional discrepancies. However, extending reconstruction-based image detection to videos is non-trivial, since video reconstruction errors are temporally organized across frames and require semantic context for effective interpretation. To address these challenges, we propose ReConFuse, a reconstruction-guided semantic fusion framework for video-level AI-generated video detection. ReConFuse extracts reconstruction error cues from WF-VAE reconstructed videos, aligns them with multi-frame semantic features, and uses a Mamba-based module to model temporal evolution for video-level classification. Experiments across multiple generators and evaluation settings demonstrate the effectiveness and strong generalization ability of ReConFuse.

Problem

Research questions and friction points this paper is trying to address.

AI-generated video detection

multimedia forensics

reconstruction error

temporal dynamics

content authenticity

Innovation

Methods, ideas, or system contributions that make the work stand out.

reconstruction error

semantic fusion

AI-generated video detection

WF-VAE

Mamba-based temporal modeling

🔎 Similar Papers

Deep Common Feature Mining for Efficient Video Semantic Segmentation

2024-03-05IEEE transactions on circuits and systems for video technology (Print)Citations: 0

What Matters in Detecting AI-Generated Videos like Sora?

2024-06-27arXiv.orgCitations: 12