GenConViT: Deepfake Video Detection Using Generative Convolutional Vision Transformer

📅 2023-07-13

📈 Citations: 10

✨ Influential: 3

career value

209K/year

🤖 AI Summary

Deepfake video detection models suffer from limited generalization across diverse generation methods and content domains, hindering real-world deployment. To address this, we propose a generative vision architecture that synergistically integrates ConvNeXt and Swin Transformer backbones, coupled with joint modeling via an Autoencoder and a Variational Autoencoder (VAE) to capture both local visual artifacts and global latent distributions—enabling multi-granularity anomaly perception. The model is jointly trained on five major benchmarks: DFDC, FF++, TalkingMovies (TM), DeepfakeTIMIT, and Celeb-DF v2, significantly enhancing cross-domain generalization. Extensive experiments demonstrate state-of-the-art accuracy across all datasets, with exceptional robustness under unseen generation methods and complex, heterogeneous content scenarios. The source code is publicly available.

📝 Abstract

Deepfakes have raised significant concerns due to their potential to spread false information and compromise digital media integrity. Current deepfake detection models often struggle to generalize across a diverse range of deepfake generation techniques and video content. In this work, we propose a Generative Convolutional Vision Transformer (GenConViT) for deepfake video detection. Our model combines ConvNeXt and Swin Transformer models for feature extraction, and it utilizes Autoencoder and Variational Autoencoder to learn from the latent data distribution. By learning from the visual artifacts and latent data distribution, GenConViT achieves improved performance in detecting a wide range of deepfake videos. The model is trained and evaluated on DFDC, FF++, TM, DeepfakeTIMIT, and Celeb-DF (v$2$) datasets. The proposed GenConViT model demonstrates strong performance in deepfake video detection, achieving high accuracy across the tested datasets. While our model shows promising results in deepfake video detection by leveraging visual and latent features, we demonstrate that further work is needed to improve its generalizability, i.e., when encountering out-of-distribution data. Our model provides an effective solution for identifying a wide range of fake videos while preserving media integrity. The open-source code for GenConViT is available at https://github.com/erprogs/GenConViT.

Problem

Research questions and friction points this paper is trying to address.

Detects deepfake videos using advanced neural network architectures.

Improves generalization across diverse deepfake generation techniques.

Enhances media integrity by identifying fake video content.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines ConvNeXt and Swin Transformer models

Utilizes Autoencoder and Variational Autoencoder

Learns from visual artifacts and latent data

🔎 Similar Papers

VGMShield: Mitigating Misuse of Video Generative Models

2024-02-20arXiv.orgCitations: 3