G2V2former: Graph Guided Video Vision Transformer for Face Anti-Spoofing

📅 2024-08-14
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing single-frame face anti-spoofing (FAS) methods neglect temporal dynamic cues, leading to erroneous classifications in scenarios where photometric features are ambiguous but motion patterns are discriminative. To address this, we propose a video-based collaborative modeling framework that jointly analyzes facial images and keypoint graph structures to simultaneously capture photometric and dynamic anomalies. Our method introduces a novel Kronecker temporal attention mechanism to expand the temporal receptive field, and a graph-guided spatiotemporal fusion paradigm that leverages low-level keypoint motion to steer high-level expression dynamics modeling. Integrating graph neural networks, video Vision Transformers (ViT), and spatiotemporally decoupled attention, the framework enables multi-scale feature aggregation. Evaluated on nine mainstream benchmarks, our approach achieves state-of-the-art performance across all datasets, particularly improving detection accuracy in motion-dominant scenarios and demonstrating strong generalization capability.

Technology Category

Application Category

📝 Abstract
In videos containing spoofed faces, we may uncover the spoofing evidence based on either photometric or dynamic abnormality, even a combination of both. Prevailing face anti-spoofing (FAS) approaches generally concentrate on the single-frame scenario, however, purely photometric-driven methods overlook the dynamic spoofing clues that may be exposed over time. This may lead FAS systems to conclude incorrect judgments, especially in cases where it is easily distinguishable in terms of dynamics but challenging to discern in terms of photometrics. To this end, we propose the Graph Guided Video Vision Transformer (G$^2$V$^2$former), which combines faces with facial landmarks for photometric and dynamic feature fusion. We factorize the attention into space and time, and fuse them via a spatiotemporal block. Specifically, we design a novel temporal attention called Kronecker temporal attention, which has a wider receptive field, and is beneficial for capturing dynamic information. Moreover, we leverage the low-semantic motion of facial landmarks to guide the high-semantic change of facial expressions based on the motivation that regions containing landmarks may reveal more dynamic clues. Extensive experiments on nine benchmark datasets demonstrate that our method achieves superior performance under various scenarios. The codes will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Detects spoofed faces in videos using dynamic and photometric features
Improves face anti-spoofing by integrating temporal and spatial attention
Utilizes facial landmarks to enhance dynamic feature detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph Guided Video Vision Transformer
Kronecker temporal attention
Spatiotemporal feature fusion
🔎 Similar Papers
No similar papers found.
Jingyi Yang
Jingyi Yang
University of Science and Technology of China
Computer VisionDeep LearningAI AgentGenerative ModelsReinforcement Learning
Zitong Yu
Zitong Yu
U.S. Food and Drug Administration
Medical imagingDeep learningMachine learningImage reconstruction
X
Xiuming Ni
Anhui Tsinglink Information Technology Co.,Ltd.
J
Jia He
Anhui Tsinglink Information Technology Co.,Ltd.
H
Hui Li
Dept. EEIS, University of Science and Technology of China, The CAS Key Laboratory of Wireless-Optical Communications