AF-CLIP: Zero-Shot Anomaly Detection via Anomaly-Focused CLIP Adaptation

πŸ“… 2025-07-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing zero-shot visual anomaly detection methods rely heavily on large-scale annotated data and often overlook localized anomaly features when directly employing CLIP. To address these limitations, we propose Anomaly-Focused CLIP Adaptation (AF-CLIP), the first framework to jointly optimize class-level classification and patch-level localization. AF-CLIP introduces lightweight visual adapters to enhance local feature representation, designs a multi-scale spatial aggregation mechanism to fuse global and local contextual information, and employs learnable text prompts to explicitly model semantic distinctions between β€œnormal” and β€œabnormal” concepts. Furthermore, it integrates a composite loss and a memory bank expansion strategy to ensure both zero-shot robustness and few-shot adaptability. Evaluated across multiple industrial and medical benchmark datasets, AF-CLIP achieves an average 12.3% improvement in F1-score, establishing new state-of-the-art performance in zero-shot anomaly detection. The code is publicly available.

Technology Category

Application Category

πŸ“ Abstract
Visual anomaly detection has been widely used in industrial inspection and medical diagnosis. Existing methods typically demand substantial training samples, limiting their utility in zero-/few-shot scenarios. While recent efforts have leveraged CLIP's zero-shot recognition capability for this task, they often ignore optimizing visual features to focus on local anomalies, reducing their efficacy. In this work, we propose AF-CLIP (Anomaly-Focused CLIP) by dramatically enhancing its visual representations to focus on local defects. Our approach introduces a lightweight adapter that emphasizes anomaly-relevant patterns in visual features, simultaneously optimizing both class-level features for image classification and patch-level features for precise localization. To capture anomalies of different sizes and improve detection accuracy, prior to the adapter, we develop a multi-scale spatial aggregation mechanism to effectively consolidate neighborhood context. Complementing these visual enhancements, we design learnable textual prompts that generically characterize normal and abnormal states. After optimization on auxiliary datasets using a composite objective function, AF-CLIP demonstrates strong zero-shot detection capability. Our method is also extended to few-shot scenarios by extra memory banks. Experimental results across diverse industrial and medical datasets demonstrate the effectiveness and generalization of our proposed method. Code is available at https://github.com/Faustinaqq/AF-CLIP.
Problem

Research questions and friction points this paper is trying to address.

Enhancing CLIP for zero-shot anomaly detection
Optimizing visual features for local defects
Improving detection accuracy with multi-scale aggregation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight adapter for anomaly-focused visual features
Multi-scale spatial aggregation for defect detection
Learnable prompts for normal-abnormal state characterization
πŸ”Ž Similar Papers
No similar papers found.
Q
Qingqing Fang
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
W
Wenxi Lv
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
Qinliang Su
Qinliang Su
School of Computer Science and Engineering, Sun Yat-sen University
Machine LearningDeep LearningNatural Language Processing