🤖 AI Summary
This study addresses the failure of conventional biometric verification in photorealistic talking-head videos, investigating whether individual facial motion patterns can serve as reliable behavioral biometrics for identity authentication. To this end, we propose a novel “dynamic behavioral fingerprint” paradigm and construct the first benchmark dataset for biometric verification tailored to realistic talking-head videos. We design a lightweight, interpretable spatio-temporal graph convolutional network (ST-GCN) that integrates temporal attention pooling to model the sequential dynamics of facial landmarks. Crucially, our method relies solely on 2D facial keypoints—excluding texture or appearance cues—and achieves nearly 80% AUC on high-fidelity single-shot talking-head videos generated by GAGAvatar. This constitutes the first empirical demonstration that facial motion patterns retain strong discriminative power for identity verification even under deepfake conditions, offering a new pathway for trustworthy authentication of AI-generated content.
📝 Abstract
Photorealistic talking-head avatars are becoming increasingly common in virtual meetings, gaming, and social platforms. These avatars allow for more immersive communication, but they also introduce serious security risks. One emerging threat is impersonation: an attacker can steal a user's avatar-preserving their appearance and voice-making it nearly impossible to detect its fraudulent usage by sight or sound alone. In this paper, we explore the challenge of biometric verification in such avatar-mediated scenarios. Our main question is whether an individual's facial motion patterns can serve as reliable behavioral biometrics to verify their identity when the avatar's visual appearance is a facsimile of its owner. To answer this question, we introduce a new dataset of realistic avatar videos created using a state-of-the-art one-shot avatar generation model, GAGAvatar, with genuine and impostor avatar videos. We also propose a lightweight, explainable spatio-temporal Graph Convolutional Network architecture with temporal attention pooling, that uses only facial landmarks to model dynamic facial gestures. Experimental results demonstrate that facial motion cues enable meaningful identity verification with AUC values approaching 80%. The proposed benchmark and biometric system are available for the research community in order to bring attention to the urgent need for more advanced behavioral biometric defenses in avatar-based communication systems.