🤖 AI Summary
To address the lack of interpretability and cross-domain generalization in morphing attack detection (MAD) for face recognition systems, this paper proposes the first image–text multimodal zero-shot learning framework. Methodologically, it leverages a pre-trained CLIP model with ten semantically explicit short- and long-form textual prompts to enable prompt-based zero-shot classification and interpretable text generation—without fine-tuning. Crucially, it introduces human-readable textual explanations into MAD for the first time, establishing a direct mapping between visual anomalies and natural-language semantic descriptions. Evaluated on a newly constructed benchmark covering five state-of-the-art morphing algorithms and three imaging modalities, our approach significantly outperforms supervised baselines, demonstrating strong robustness and generalization across morphing techniques and acquisition media. This work establishes a novel paradigm for trustworthy face verification grounded in interpretable, zero-shot multimodal reasoning.
📝 Abstract
Morphing attack detection has become an essential component of face recognition systems for ensuring a reliable verification scenario. In this paper, we present a multimodal learning approach that can provide a textual description of morphing attack detection. We first show that zero-shot evaluation of the proposed framework using Contrastive Language-Image Pretraining (CLIP) can yield not only generalizable morphing attack detection, but also predict the most relevant text snippet. We present an extensive analysis of ten different textual prompts that include both short and long textual prompts. These prompts are engineered by considering the human understandable textual snippet. Extensive experiments were performed on a face morphing dataset that was developed using a publicly available face biometric dataset. We present an evaluation of SOTA pre-trained neural networks together with the proposed framework in the zero-shot evaluation of five different morphing generation techniques that are captured in three different mediums.