MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection

📅 2024-09-15
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Existing face forgery detection methods predominantly rely on single-image modalities, exhibiting poor generalization—especially against unseen diffusion-generated faces. To address this, we propose the Generalizable Diffusion Face Forgery Detection (DFFD) framework, the first to jointly leverage image, fine-grained noise, and text modalities. Built upon the CLIP architecture, DFFD introduces a language-guided multimodal representation learning paradigm. Key contributions include: (1) a Fine-Grained Language Encoder (FLE) and a Multimodal Visual Encoder (MVE); (2) a plug-and-play Sample-Pair Attention (SPA) mechanism enabling flexible cross-modal alignment; and (3) hierarchical text prompt encoding coupled with rich-patch noise feature extraction. Extensive evaluations across generators, forgery types, and datasets demonstrate significant improvements over state-of-the-art methods, achieving superior generalization and interpretability.

Technology Category

Application Category

📝 Abstract
The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia, highlighting the urgent need for robust and generalizable face forgery detection (FFD) techniques. Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored, which limits the generalization capability of the model. In addition, most FFD methods tend to identify facial images generated by GAN, but struggle to detect unseen diffusion-synthesized ones. To address the limitations, we aim to leverage the cutting-edge foundation model, contrastive language-image pre-training (CLIP), to achieve generalizable diffusion face forgery detection (DFFD). In this paper, we propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities via language-guided face forgery representation learning, to facilitate the advancement of DFFD. Specifically, we devise a fine-grained language encoder (FLE) that extracts fine global language features from hierarchical text prompts. We design a multi-modal vision encoder (MVE) to capture global image forgery embeddings as well as fine-grained noise forgery patterns extracted from the richest patch, and integrate them to mine general visual forgery traces. Moreover, we build an innovative plug-and-play sample pair attention (SPA) method to emphasize relevant negative pairs and suppress irrelevant ones, allowing cross-modality sample pairs to conduct more flexible alignment. Extensive experiments and visualizations show that our model outperforms the state of the arts on different settings like cross-generator, cross-forgery, and cross-dataset evaluations.
Problem

Research questions and friction points this paper is trying to address.

Detect diffusion-synthesized face forgeries robustly and generally
Explore multi-modal forgery traces beyond image patterns
Enhance generalization via language-guided representation learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal CLIP for fine-grained forgery detection
Language-guided face forgery representation learning
Plug-and-play sample pair attention alignment
🔎 Similar Papers
No similar papers found.
Yaning Zhang
Yaning Zhang
Qilu University of Technology (Shandong Academy of Sciences)
T
Tianyi Wang
Nanyang Technological University, 50 Nanyang Ave, Block N 4, 639798, Singapore
Zitong Yu
Zitong Yu
U.S. Food and Drug Administration
Medical imagingDeep learningMachine learningImage reconstruction
Z
Zan Gao
Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences), Jinan, 250014, China, and also with the Key Laboratory of Computer Vision and System, Ministry of Education, Tianjin University of Technology, Tianjin, 300384, China
Linlin Shen
Linlin Shen
Shenzhen University
Deep LearningComputer VisionFacial Analysis/RecognitionMedical Image Analysis
S
Shengyong Chen
Key Laboratory of Computer Vision and System, Ministry of Education, Tianjin University of Technology, Tianjin, 300384, China