Segmentation-Guided Spatial Indexing for Generalizable and Explainable Deepfake Detection

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
This work addresses the challenge in deepfake detection where holistic facial representations dilute critical forgery artifacts. To mitigate this, the authors propose a semantic segmentation–guided spatial indexing mechanism that leverages a frozen FaRL semantic parser to localize forensically relevant regions—such as the mouth—and extracts corresponding patch tokens from a pretrained DINOv2 ViT-L/16 model for classification via a linear probe. Notably, this approach achieves strong generalization and interpretable, region-specific attribution without fine-tuning the backbone or requiring target-domain data. On Celeb-DF v2, the mouth-indexed model attains an AUC of 0.905, substantially outperforming LipForensics (+8.1%) and Xception (+16.9%). Ablation studies confirm the necessity of both DINOv2’s rich representations and the spatial indexing strategy.
📝 Abstract
We introduce segmentation-guided spatial indexing for generalizable and explainable deepfake detection. The key idea reverses the standard design order: rather than pooling all facial tokens and classifying afterward, we first select semantically meaningful patch tokens, then pool only those. A frozen FaRL parser assigns each DINOv3 ViT-L/16 patch token a semantic label; non-target tokens are discarded; a linear probe classifies the retained region. This spatial indexing exploits DINOv3's patch-level spatial consistency, the same property that enables emergent segmentation, to present the probe with a purer regional subspace where manipulation-relevant evidence is less diluted by whole-face cues. Region attribution is structural: when the mouth model predicts fake, the decision used only mouth tokens, not an overlaid saliency map. On Celeb-DF v2, the mouth-indexed probe achieves AUC 0.905, outperforming LipForensics (+8.1 pp) and Xception (+16.9 pp), with no DINOv3 or FaRL fine-tuning and no target-domain data. Ablations isolate the mechanism: replacing regional selection with DINOv3's CLS token drops Celeb-DF v2 AUC by 26.4 pp; replacing DINOv3 with FaRL features drops it by 20.9 pp. Both DINOv3 representation and the spatial index are independently necessary; neither alone approaches the full system.
Problem

Research questions and friction points this paper is trying to address.

deepfake detection
generalizability
explainability
spatial indexing
semantic segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

segmentation-guided spatial indexing
deepfake detection
DINOv3
semantic token selection
explainable AI