🤖 AI Summary
To address the weak generalization and poor robustness of deepfake face detection in unconstrained, real-world scenarios, this paper proposes an attention-driven supervised contrastive learning framework. The method jointly integrates three heterogeneous backbone architectures—MaxViT (featuring stride-aware attention), CoAtNet (a convolution-attention hybrid), and EVA-02 (pretrained via masked image modeling)—to collaboratively capture local forensic details, multi-scale structural artifacts, and global semantic inconsistencies induced by forgery. A frozen-backbone fine-tuning strategy with independent classification heads is adopted, followed by majority-voting ensemble to enhance prediction stability. Evaluated on the DFWild-Cup benchmark, the framework achieves 95.83% accuracy, demonstrating substantial improvements in cross-dataset generalization and robustness under realistic, in-the-wild conditions.
📝 Abstract
This report presents our approach for the IEEE SP Cup 2025: Deepfake Face Detection in the Wild (DFWild-Cup), focusing on detecting deepfakes across diverse datasets. Our methodology employs advanced backbone models, including MaxViT, CoAtNet, and EVA-02, fine-tuned using supervised contrastive loss to enhance feature separation. These models were specifically chosen for their complementary strengths. Integration of convolution layers and strided attention in MaxViT is well-suited for detecting local features. In contrast, hybrid use of convolution and attention mechanisms in CoAtNet effectively captures multi-scale features. Robust pretraining with masked image modeling of EVA-02 excels at capturing global features. After training, we freeze the parameters of these models and train the classification heads. Finally, a majority voting ensemble is employed to combine the predictions from these models, improving robustness and generalization to unseen scenarios. The proposed system addresses the challenges of detecting deepfakes in real-world conditions and achieves a commendable accuracy of 95.83% on the validation dataset.