🤖 AI Summary
This work addresses the challenge of balancing accuracy and real-time performance in face presentation attack detection under complex scenarios, particularly hindered by the high computational cost of optical flow estimation that impedes practical deployment. To overcome this limitation, the authors propose an efficient motion-aware approach based on knowledge distillation. A dual-branch teacher model is trained by fusing RGB and optical flow modalities, and a logit-level distillation mechanism is devised to implicitly transfer motion-sensitive features to a lightweight RGB-only student model, eliminating the need for optical flow computation during inference. By integrating color wheel encoding with a compact CNN architecture, the student model achieves state-of-the-art performance across multiple benchmarks—attaining a 0.0% HTER on Replay-Attack—while significantly reducing parameters and FLOPs, enabling real-time inference at 52 FPS on a Jetson Orin Nano.
📝 Abstract
Face presentation attack detection (FacePAD) remains challenging under diverse spoofing representation, including 2D print and replay, 3D mask-based spoofing, makeup-induced appearance manipulation, and physical occlusions, as well as under varying capture conditions. Motion cues are highly discriminative for FacePAD but typically require explicit optical flow estimation, which introduces substantial computational overhead and limits real-time deployment. In this work, we leverage optical flow to enhance motion representation during training while eliminating the need for flow computation at inference. We propose a dual-branch teacher model that fuses appearance cues from RGB frames with motion cues derived from colorwheel-encoded optical flow, enabling effective modeling of micro-motions and temporal consistency. To enable efficient deployment, we introduce a knowledge distillation framework that transfers motion-aware knowledge from the flow-augmented teacher to a lightweight RGB-only student via logit distillation. As a result, the student implicitly learns motion-sensitive representations without requiring explicit flow estimation or additional feature extraction blocks at inference. Extensive experiments demonstrate strong performance across multiple benchmarks, achieving 0.0% HTER on Replay-Attack and Replay-Mobile, 0.94% HTER on ROSE-Youtu, 5.65% HTER on SiW-Mv2, and 0.42% ACER on OULU-NPU. The distilled student achieves performance comparable to or better than the teacher while significantly reducing parameters and FLOPs, achieving 52 FPS on an NVIDIA Jetson Orin Nano, indicating its suitability for real-time and resource-constrained FacePAD deployment.