🤖 AI Summary
Face anti-spoofing (FAS) faces two key challenges: insufficient modeling of attack semantics and redundant cross-domain training. To address these, we propose the first instruction-guided vision-language unified framework for FAS. Our method decouples “content instructions”—explicitly encoding spoofing-type semantics—from “style instructions”—implicitly capturing environmental domain shifts—and integrates a meta-domain adaptation strategy to achieve strong cross-domain generalization under single-domain training, eliminating the need for costly multi-domain retraining. Leveraging vision-language pre-trained models, we introduce text prompt engineering to steer discriminative feature learning. Evaluated on multiple standard benchmarks, our approach consistently surpasses state-of-the-art methods using significantly fewer training resources, achieving substantial accuracy gains. This demonstrates a principled unification of semantic interpretability and training efficiency in FAS.
📝 Abstract
Face anti-spoofing (FAS) aims to construct a robust system that can withstand diverse attacks. While recent efforts have concentrated mainly on cross-domain generalization, two significant challenges persist: limited semantic understanding of attack types and training redundancy across domains. We address the first by integrating vision-language models (VLMs) to enhance the perception of visual input. For the second challenge, we employ a meta-domain strategy to learn a unified model that generalizes well across multiple domains. Our proposed InstructFLIP is a novel instruction-tuned framework that leverages VLMs to enhance generalization via textual guidance trained solely on a single domain. At its core, InstructFLIP explicitly decouples instructions into content and style components, where content-based instructions focus on the essential semantics of spoofing, and style-based instructions consider variations related to the environment and camera characteristics. Extensive experiments demonstrate the effectiveness of InstructFLIP by outperforming SOTA models in accuracy and substantially reducing training redundancy across diverse domains in FAS. Project website is available at https://kunkunlin1221.github.io/InstructFLIP.