π€ AI Summary
Existing zero-shot visual anomaly detection methods rely heavily on large-scale annotated data and often overlook localized anomaly features when directly employing CLIP. To address these limitations, we propose Anomaly-Focused CLIP Adaptation (AF-CLIP), the first framework to jointly optimize class-level classification and patch-level localization. AF-CLIP introduces lightweight visual adapters to enhance local feature representation, designs a multi-scale spatial aggregation mechanism to fuse global and local contextual information, and employs learnable text prompts to explicitly model semantic distinctions between βnormalβ and βabnormalβ concepts. Furthermore, it integrates a composite loss and a memory bank expansion strategy to ensure both zero-shot robustness and few-shot adaptability. Evaluated across multiple industrial and medical benchmark datasets, AF-CLIP achieves an average 12.3% improvement in F1-score, establishing new state-of-the-art performance in zero-shot anomaly detection. The code is publicly available.
π Abstract
Visual anomaly detection has been widely used in industrial inspection and medical diagnosis. Existing methods typically demand substantial training samples, limiting their utility in zero-/few-shot scenarios. While recent efforts have leveraged CLIP's zero-shot recognition capability for this task, they often ignore optimizing visual features to focus on local anomalies, reducing their efficacy. In this work, we propose AF-CLIP (Anomaly-Focused CLIP) by dramatically enhancing its visual representations to focus on local defects. Our approach introduces a lightweight adapter that emphasizes anomaly-relevant patterns in visual features, simultaneously optimizing both class-level features for image classification and patch-level features for precise localization. To capture anomalies of different sizes and improve detection accuracy, prior to the adapter, we develop a multi-scale spatial aggregation mechanism to effectively consolidate neighborhood context. Complementing these visual enhancements, we design learnable textual prompts that generically characterize normal and abnormal states. After optimization on auxiliary datasets using a composite objective function, AF-CLIP demonstrates strong zero-shot detection capability. Our method is also extended to few-shot scenarios by extra memory banks. Experimental results across diverse industrial and medical datasets demonstrate the effectiveness and generalization of our proposed method. Code is available at https://github.com/Faustinaqq/AF-CLIP.