🤖 AI Summary
Existing deepfake detection models struggle to simultaneously capture local details (via CNNs) and model global semantics (via Transformers), suffer from attention instability in hybrid architectures, and exhibit poor generalization under limited training data. To address these issues, this paper proposes a lightweight distilled Transformer architecture. Its core innovation is a local-enhanced global representation distillation paradigm: leveraging hierarchical attention mechanisms and local window-based feature enhancement, the framework jointly compresses model capacity while preserving discriminative forgery cues within a knowledge distillation framework. Evaluated on FaceForensics++ and Celeb-DF, the method achieves state-of-the-art performance, with a maximum AUC of 99.2%. It reduces parameter count by 38% and accelerates inference by 2.1× compared to baseline models, demonstrating superior efficiency and generalization—especially in low-data regimes.