🤖 AI Summary
Industrial anomaly detection (AD) faces a labeling bottleneck: existing methods rely on defect-free samples for training, while pixel-level annotations for defective samples are costly and prohibitively labor-intensive to scale. This paper introduces ADClick—the first framework to integrate interactive image segmentation with vision-language cross-modal alignment for industrial AD. ADClick generates high-fidelity pixel-level anomaly masks using only a few user clicks (e.g., 1–3 points) and a brief textual description. Its core innovation lies in jointly leveraging click-guided pixel priors, text-semantic guidance, and prototype-network-driven cross-modal feature alignment. On the MVTec AD benchmark, ADClick achieves 96.1% AP (single-class), 80.0% AP (multi-class), 97.5% PRO, and 99.1% Pixel-AUROC—substantially outperforming state-of-the-art weakly supervised and interactive AD methods. ADClick thus enables efficient, accurate, and scalable industrial anomaly localization.
📝 Abstract
Industrial product inspection is often performed using Anomaly Detection (AD) frameworks trained solely on non-defective samples. Although defective samples can be collected during production, leveraging them usually requires pixel-level annotations, limiting scalability. To address this, we propose ADClick, an Interactive Image Segmentation (IIS) algorithm for industrial anomaly detection. ADClick generates pixel-wise anomaly annotations from only a few user clicks and a brief textual description, enabling precise and efficient labeling that significantly improves AD model performance (e.g., AP = 96.1% on MVTec AD). We further introduce ADClick-Seg, a cross-modal framework that aligns visual features and textual prompts via a prototype-based approach for anomaly detection and localization. By combining pixel-level priors with language-guided cues, ADClick-Seg achieves state-of-the-art results on the challenging ``Multi-class'' AD task (AP = 80.0%, PRO = 97.5%, Pixel-AUROC = 99.1% on MVTec AD).