Spotting tell-tale visual artifacts in face swapping videos: strengths and pitfalls of CNN detectors

📅 2025-06-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Convolutional neural network (CNN) detectors for face-swap video forensics exhibit poor cross-dataset generalization, particularly in detecting occlusion-related visual artifacts under real-world physical conditions. Method: We propose an occlusion-sensitive artifact analysis framework and systematically evaluate the cross-algorithm and cross-domain generalization of mainstream CNN architectures—including ResNet and EfficientNet—on a newly constructed dataset alongside multiple public benchmarks. Contribution/Results: Experiments show that while models achieve >95% accuracy on in-distribution test sets, performance drops by over 30% under cross-dataset evaluation, exposing a fundamental limitation in modeling physically grounded occlusion artifacts. This degradation reveals that generic, data-driven CNNs fail to capture the intrinsic geometric and photometric constraints governing real-world occlusions. We thus argue for designing dedicated detection strategies explicitly aligned with realistic occlusion mechanisms—rather than relying solely on scalable but brittle end-to-end learning paradigms. Our findings provide critical empirical evidence and methodological guidance for enhancing the robustness and deployability of face-swap detectors in practical forensic scenarios.

Technology Category

Application Category

📝 Abstract
Face swapping manipulations in video streams represents an increasing threat in remote video communications, due to advances in automated and real-time tools. Recent literature proposes to characterize and exploit visual artifacts introduced in video frames by swapping algorithms when dealing with challenging physical scenes, such as face occlusions. This paper investigates the effectiveness of this approach by benchmarking CNN-based data-driven models on two data corpora (including a newly collected one) and analyzing generalization capabilities with respect to different acquisition sources and swapping algorithms. The results confirm excellent performance of general-purpose CNN architectures when operating within the same data source, but a significant difficulty in robustly characterizing occlusion-based visual cues across datasets. This highlights the need for specialized detection strategies to deal with such artifacts.
Problem

Research questions and friction points this paper is trying to address.

Detect visual artifacts in face-swapped videos
Evaluate CNN models on swapped face datasets
Assess generalization across sources and algorithms
Innovation

Methods, ideas, or system contributions that make the work stand out.

CNN-based models detect face swapping artifacts
Benchmarking on diverse data corpora
Analyzing generalization across acquisition sources
🔎 Similar Papers
No similar papers found.