AF-CLIP: Zero-Shot Anomaly Detection via Anomaly-Focused CLIP Adaptation

📅 2025-07-26

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing zero-shot visual anomaly detection methods rely heavily on large-scale annotated data and often overlook localized anomaly features when directly employing CLIP. To address these limitations, we propose Anomaly-Focused CLIP Adaptation (AF-CLIP), the first framework to jointly optimize class-level classification and patch-level localization. AF-CLIP introduces lightweight visual adapters to enhance local feature representation, designs a multi-scale spatial aggregation mechanism to fuse global and local contextual information, and employs learnable text prompts to explicitly model semantic distinctions between “normal” and “abnormal” concepts. Furthermore, it integrates a composite loss and a memory bank expansion strategy to ensure both zero-shot robustness and few-shot adaptability. Evaluated across multiple industrial and medical benchmark datasets, AF-CLIP achieves an average 12.3% improvement in F1-score, establishing new state-of-the-art performance in zero-shot anomaly detection. The code is publicly available.

Technology Category

Application Category

📝 Abstract

Visual anomaly detection has been widely used in industrial inspection and medical diagnosis. Existing methods typically demand substantial training samples, limiting their utility in zero-/few-shot scenarios. While recent efforts have leveraged CLIP's zero-shot recognition capability for this task, they often ignore optimizing visual features to focus on local anomalies, reducing their efficacy. In this work, we propose AF-CLIP (Anomaly-Focused CLIP) by dramatically enhancing its visual representations to focus on local defects. Our approach introduces a lightweight adapter that emphasizes anomaly-relevant patterns in visual features, simultaneously optimizing both class-level features for image classification and patch-level features for precise localization. To capture anomalies of different sizes and improve detection accuracy, prior to the adapter, we develop a multi-scale spatial aggregation mechanism to effectively consolidate neighborhood context. Complementing these visual enhancements, we design learnable textual prompts that generically characterize normal and abnormal states. After optimization on auxiliary datasets using a composite objective function, AF-CLIP demonstrates strong zero-shot detection capability. Our method is also extended to few-shot scenarios by extra memory banks. Experimental results across diverse industrial and medical datasets demonstrate the effectiveness and generalization of our proposed method. Code is available at https://github.com/Faustinaqq/AF-CLIP.

Problem

Research questions and friction points this paper is trying to address.

Enhancing CLIP for zero-shot anomaly detection

Optimizing visual features for local defects

Improving detection accuracy with multi-scale aggregation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight adapter for anomaly-focused visual features

Multi-scale spatial aggregation for defect detection

Learnable prompts for normal-abnormal state characterization

🔎 Similar Papers

AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection