Detecting Localized Deepfake Manipulations Using Action Unit-Guided Video Representations

📅 2025-03-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing deepfake detection methods struggle to identify fine-grained, localized manipulations—such as unilateral eyebrow raising or subtle eye shape adjustments—in synthetic videos. Method: This paper proposes the first facial Action Unit (AU)-guided spatiotemporal representation framework specifically designed for detecting such local tampering. It introduces an AU-guided cross-modal cross-attention fusion mechanism, integrated with AU pretraining, random masking-based self-supervised learning, and explicit spatiotemporal feature modeling, yielding a lightweight and generalizable detector. Contribution/Results: Trained solely on the FF++ dataset, our method achieves a 20% accuracy gain on novel locally manipulated videos—significantly outperforming current state-of-the-art approaches—while maintaining competitive performance across mainstream benchmarks (e.g., FaceForensics++, Celeb-DF, DFDC). This demonstrates strong robustness and cross-manipulation generalization. Our work establishes a new paradigm for deepfake detection, shifting focus from global consistency verification to modeling local physiological plausibility grounded in facial action units.

Technology Category

Application Category

📝 Abstract

With rapid advancements in generative modeling, deepfake techniques are increasingly narrowing the gap between real and synthetic videos, raising serious privacy and security concerns. Beyond traditional face swapping and reenactment, an emerging trend in recent state-of-the-art deepfake generation methods involves localized edits such as subtle manipulations of specific facial features like raising eyebrows, altering eye shapes, or modifying mouth expressions. These fine-grained manipulations pose a significant challenge for existing detection models, which struggle to capture such localized variations. To the best of our knowledge, this work presents the first detection approach explicitly designed to generalize to localized edits in deepfake videos by leveraging spatiotemporal representations guided by facial action units. Our method leverages a cross-attention-based fusion of representations learned from pretext tasks like random masking and action unit detection, to create an embedding that effectively encodes subtle, localized changes. Comprehensive evaluations across multiple deepfake generation methods demonstrate that our approach, despite being trained solely on the traditional FF+ dataset, sets a new benchmark in detecting recent deepfake-generated videos with fine-grained local edits, achieving a $20%$ improvement in accuracy over current state-of-the-art detection methods. Additionally, our method delivers competitive performance on standard datasets, highlighting its robustness and generalization across diverse types of local and global forgeries.

Problem

Research questions and friction points this paper is trying to address.

Detecting localized deepfake facial manipulations

Improving accuracy for fine-grained deepfake edits

Generalizing detection across local and global forgeries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses facial action units for deepfake detection

Combines random masking and AU detection tasks

Improves accuracy by 20% over existing methods

🔎 Similar Papers

No similar papers found.

Authors to Follow