GazeCLIP: Gaze-Guided CLIP with Adaptive-Enhanced Fine-Grained Language Prompt for Deepfake Attribution and Detection

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limited generalization of existing deepfake provenance and detection methods, which often overlook non-visual cues and inter-task synergies. To bridge this gap, the work introduces human gaze as a novel supervisory signal and proposes a gaze-guided multimodal CLIP framework. The framework integrates a gaze-aware image encoder with an adaptive language prompt refinement mechanism, enabling fine-grained cross-domain forgery feature extraction and precise vision-language alignment. Evaluated on a newly curated fine-grained benchmark, the proposed method outperforms the current state of the art by 6.56% in provenance accuracy and 5.32% in detection AUC, demonstrating its effectiveness in leveraging human attention cues for enhanced deepfake analysis.
📝 Abstract
Current deepfake attribution or deepfake detection works tend to exhibit poor generalization to novel generative methods due to the limited exploration in visual modalities alone. They tend to assess the attribution or detection performance of models on unseen advanced generators, coarsely, and fail to consider the synergy of the two tasks. To this end, we propose a novel gaze-guided CLIP with adaptive-enhanced fine-grained language prompts for fine-grained deepfake attribution and detection (DFAD). Specifically, we conduct a novel and fine-grained benchmark to evaluate the DFAD performance of networks on novel generators like diffusion and flow models. Additionally, we introduce a gaze-aware model based on CLIP, which is devised to enhance the generalization to unseen face forgery attacks. Built upon the novel observation that there are significant distribution differences between pristine and forged gaze vectors, and the preservation of the target gaze in facial images generated by GAN and diffusion varies significantly, we design a visual perception encoder to employ the inherent gaze differences to mine global forgery embeddings across appearance and gaze domains. We propose a gaze-aware image encoder (GIE) that fuses forgery gaze prompts extracted via a gaze encoder with common forged image embeddings to capture general attribution patterns, allowing features to be transformed into a more stable and common DFAD feature space. We build a language refinement encoder (LRE) to generate dynamically enhanced language embeddings via an adaptive-enhanced word selector for precise vision-language matching. Extensive experiments on our benchmark show that our model outperforms the state-of-the-art by 6.56% ACC and 5.32% AUC in average performance under the attribution and detection settings, respectively. Codes will be available on GitHub.
Problem

Research questions and friction points this paper is trying to address.

deepfake attribution
deepfake detection
generalization
novel generative methods
vision-language alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

gaze-guided CLIP
fine-grained language prompt
deepfake attribution
visual-language alignment
forgery generalization
🔎 Similar Papers
Yaning Zhang
Yaning Zhang
Qilu University of Technology (Shandong Academy of Sciences)
Linlin Shen
Linlin Shen
Shenzhen University
Deep LearningComputer VisionFacial Analysis/RecognitionMedical Image Analysis
Zitong Yu
Zitong Yu
U.S. Food and Drug Administration
Medical imagingDeep learningMachine learningImage reconstruction
Chunjie Ma
Chunjie Ma
Qilu University of Technology
object detection
Z
Zan Gao
Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences), Jinan, 250014, China; Key Laboratory of Computer Vision and System, Ministry of Education, Tianjin University of Technology, Tianjin, 300384, China