Advance Fake Video Detection via Vision Transformers

📅 2025-04-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the escalating information security risks posed by the proliferation of AI-generated deepfake videos, this paper proposes the first temporal fusion detection framework based on Vision Transformers (ViT). Methodologically, it introduces a ViT-embedding-driven cross-frame temporal aggregation mechanism, integrated with contrastive learning and multi-source generative video representation modeling, enabling few-shot fine-tuning. Key contributions include: (1) constructing the first large-scale, diverse benchmark dataset of synthetic videos covering five mainstream open-source generative models; and (2) achieving high-accuracy, robust generalization in detecting both open-source and closed-source generated videos. Experiments demonstrate 96.2% accuracy on the proposed benchmark, a 41% reduction in cross-generator generalization error, and 92.7% accuracy using only five labeled samples for fine-tuning.

Technology Category

Application Category

📝 Abstract
Recent advancements in AI-based multimedia generation have enabled the creation of hyper-realistic images and videos, raising concerns about their potential use in spreading misinformation. The widespread accessibility of generative techniques, which allow for the production of fake multimedia from prompts or existing media, along with their continuous refinement, underscores the urgent need for highly accurate and generalizable AI-generated media detection methods, underlined also by new regulations like the European Digital AI Act. In this paper, we draw inspiration from Vision Transformer (ViT)-based fake image detection and extend this idea to video. We propose an {original} %innovative framework that effectively integrates ViT embeddings over time to enhance detection performance. Our method shows promising accuracy, generalization, and few-shot learning capabilities across a new, large and diverse dataset of videos generated using five open source generative techniques from the state-of-the-art, as well as a separate dataset containing videos produced by proprietary generative methods.
Problem

Research questions and friction points this paper is trying to address.

Detect hyper-realistic fake videos using Vision Transformers
Address misinformation risks from AI-generated multimedia
Improve accuracy and generalization in fake video detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision Transformers for fake video detection
Integrates ViT embeddings over time
Tests on diverse datasets including proprietary methods
🔎 Similar Papers
No similar papers found.