Evidence Packing for Cross-Domain Image Deepfake Detection with LVLMs

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing large vision-language model (LVLM)-based deepfake detection methods, which rely on costly fine-tuning and exhibit poor generalization across domains and to novel forgery types. To overcome these challenges, we propose Semantic-Consistent Evidence Packets (SCEP), a training-free LVLM inference framework that identifies high-confidence suspicious image patches and integrates multi-dimensional cues—semantic, frequency-domain, and noise-based—into a compact, non-redundant evidence set. Using the CLS token from the frozen visual encoder as a global reference, SCEP guides the LVLM to produce interpretable forgery judgments without any parameter updates. Our approach achieves state-of-the-art performance across multiple benchmarks, demonstrating for the first time that high-accuracy, cross-domain deepfake detection is feasible without fine-tuning.

Technology Category

Application Category

📝 Abstract
Image Deepfake Detection (IDD) separates manipulated images from authentic ones by spotting artifacts of synthesis or tampering. Although large vision-language models (LVLMs) offer strong image understanding, adapting them to IDD often demands costly fine-tuning and generalizes poorly to diverse, evolving manipulations. We propose the Semantic Consistent Evidence Pack (SCEP), a training-free LVLM framework that replaces whole-image inference with evidence-driven reasoning. SCEP mines a compact set of suspicious patch tokens that best reveal manipulation cues. It uses the vision encoder's CLS token as a global reference, clusters patch features into coherent groups, and scores patches with a fused metric combining CLS-guided semantic mismatch with frequency-and noise-based anomalies. To cover dispersed traces and avoid redundancy, SCEP samples a few high-confidence patches per cluster and applies grid-based NMS, producing an evidence pack that conditions a frozen LVLM for prediction. Experiments on diverse benchmarks show SCEP outperforms strong baselines without LVLM fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

Image Deepfake Detection
Cross-Domain
Large Vision-Language Models
Generalization
Manipulation Artifacts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evidence Packing
Training-Free LVLM
Semantic Consistency
Patch Token Mining
Cross-Domain Deepfake Detection
🔎 Similar Papers
No similar papers found.
Y
Yuxin Liu
Anhui University
Fei Wang
Fei Wang
Hefei University of Technology
Motion MagnificationMLLMAffective Computing
K
Kun Li
United Arab Emirates University
Y
Yiqi Nie
Anhui University
J
Junjie Chen
Hefei University of Technology
Z
Zhangling Duan
IAI, Hefei Comprehensive National Science Center
Z
Zhaohong Jia
Anhui University