🤖 AI Summary
This work addresses the challenge in deepfake detection where holistic facial representations dilute critical forgery artifacts. To mitigate this, the authors propose a semantic segmentation–guided spatial indexing mechanism that leverages a frozen FaRL semantic parser to localize forensically relevant regions—such as the mouth—and extracts corresponding patch tokens from a pretrained DINOv2 ViT-L/16 model for classification via a linear probe. Notably, this approach achieves strong generalization and interpretable, region-specific attribution without fine-tuning the backbone or requiring target-domain data. On Celeb-DF v2, the mouth-indexed model attains an AUC of 0.905, substantially outperforming LipForensics (+8.1%) and Xception (+16.9%). Ablation studies confirm the necessity of both DINOv2’s rich representations and the spatial indexing strategy.
📝 Abstract
We introduce segmentation-guided spatial indexing for generalizable and explainable deepfake detection. The key idea reverses the standard design order: rather than pooling all facial tokens and classifying afterward, we first select semantically meaningful patch tokens, then pool only those. A frozen FaRL parser assigns each DINOv3 ViT-L/16 patch token a semantic label; non-target tokens are discarded; a linear probe classifies the retained region. This spatial indexing exploits DINOv3's patch-level spatial consistency, the same property that enables emergent segmentation, to present the probe with a purer regional subspace where manipulation-relevant evidence is less diluted by whole-face cues. Region attribution is structural: when the mouth model predicts fake, the decision used only mouth tokens, not an overlaid saliency map. On Celeb-DF v2, the mouth-indexed probe achieves AUC 0.905, outperforming LipForensics (+8.1 pp) and Xception (+16.9 pp), with no DINOv3 or FaRL fine-tuning and no target-domain data. Ablations isolate the mechanism: replacing regional selection with DINOv3's CLS token drops Celeb-DF v2 AUC by 26.4 pp; replacing DINOv3 with FaRL features drops it by 20.9 pp. Both DINOv3 representation and the spatial index are independently necessary; neither alone approaches the full system.