🤖 AI Summary
This work proposes a unified deepfake detection framework that addresses the growing threat of deepfakes to digital trust by integrating spatial-frequency cross-attention mechanisms with physiological signals—such as blood pulsation—for high-precision identification of manipulated content in both images and videos. The approach leverages either Swin Transformer or EfficientNet-B4 for visual feature extraction and employs BERT for multimodal fusion. Evaluated on the FaceForensics++ (FF++) and Celeb-DF benchmarks, the Swin+BERT variant achieves state-of-the-art performance with AUC scores of 99.80% and 99.88%, respectively, significantly outperforming existing methods while demonstrating exceptional cross-dataset generalization capability.
📝 Abstract
The advancements in the field of AI is increasingly giving rise to various threats. One of the most prominent of them is the synthesis and misuse of Deepfakes. To sustain trust in this digital age, detection and tagging of deepfakes is very necessary. In this paper, a novel architecture for Deepfake detection in images and videos is presented. The architecture uses cross attention between spatial and frequency domain features along with a blood detection module to classify an image as real or fake. This paper aims to develop a unified architecture and provide insights into each step. Though this approach we achieve results better than SOTA, specifically 99.80%, 99.88% AUC on FF++ and Celeb-DF upon using Swin Transformer and BERT and 99.55, 99.38 while using EfficientNet-B4 and BERT. The approach also generalizes very well achieving great cross dataset results as well.