Segmentation-Guided Spatial Indexing for Generalizable and Explainable Deepfake Detection

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the challenge in deepfake detection where holistic facial representations dilute critical forgery artifacts. To mitigate this, the authors propose a semantic segmentation–guided spatial indexing mechanism that leverages a frozen FaRL semantic parser to localize forensically relevant regions—such as the mouth—and extracts corresponding patch tokens from a pretrained DINOv2 ViT-L/16 model for classification via a linear probe. Notably, this approach achieves strong generalization and interpretable, region-specific attribution without fine-tuning the backbone or requiring target-domain data. On Celeb-DF v2, the mouth-indexed model attains an AUC of 0.905, substantially outperforming LipForensics (+8.1%) and Xception (+16.9%). Ablation studies confirm the necessity of both DINOv2’s rich representations and the spatial indexing strategy.

📝 Abstract

We introduce segmentation-guided spatial indexing for generalizable and explainable deepfake detection. The key idea reverses the standard design order: rather than pooling all facial tokens and classifying afterward, we first select semantically meaningful patch tokens, then pool only those. A frozen FaRL parser assigns each DINOv3 ViT-L/16 patch token a semantic label; non-target tokens are discarded; a linear probe classifies the retained region. This spatial indexing exploits DINOv3's patch-level spatial consistency, the same property that enables emergent segmentation, to present the probe with a purer regional subspace where manipulation-relevant evidence is less diluted by whole-face cues. Region attribution is structural: when the mouth model predicts fake, the decision used only mouth tokens, not an overlaid saliency map. On Celeb-DF v2, the mouth-indexed probe achieves AUC 0.905, outperforming LipForensics (+8.1 pp) and Xception (+16.9 pp), with no DINOv3 or FaRL fine-tuning and no target-domain data. Ablations isolate the mechanism: replacing regional selection with DINOv3's CLS token drops Celeb-DF v2 AUC by 26.4 pp; replacing DINOv3 with FaRL features drops it by 20.9 pp. Both DINOv3 representation and the spatial index are independently necessary; neither alone approaches the full system.

Problem

Research questions and friction points this paper is trying to address.

deepfake detection

generalizability

explainability

spatial indexing

semantic segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

segmentation-guided spatial indexing

deepfake detection

DINOv3