FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization

📅 2025-01-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Addressing key challenges in zero- and few-shot anomaly detection—including the absence of normal samples from the target class, difficulty in fine-grained localization, and imprecise anomaly descriptions—this paper proposes FusDes-DefLoc, a novel framework comprising two core components. First, FusDes introduces fusion-based fine-grained description generation, enabling class-adaptive semantic characterization of anomalies without relying on target-class normal data. Second, DefLoc employs deformable cross-modal localization via multi-scale deformable cross-modal interaction (MDCI) and position-enhanced patch matching, achieving accurate pixel-level localization even for irregular anomaly regions. By synergistically integrating large language models with Grounding DINO, FusDes-DefLoc performs end-to-end detection and precise localization without requiring any normal samples from the target class. Extensive experiments demonstrate substantial improvements over state-of-the-art methods across multiple benchmarks, significantly enhancing both detection accuracy and localization robustness in zero- and few-shot settings.

Technology Category

Application Category

📝 Abstract

Anomaly detection methods typically require extensive normal samples from the target class for training, limiting their applicability in scenarios that require rapid adaptation, such as cold start. Zero-shot and few-shot anomaly detection do not require labeled samples from the target class in advance, making them a promising research direction. Existing zero-shot and few-shot approaches often leverage powerful multimodal models to detect and localize anomalies by comparing image-text similarity. However, their handcrafted generic descriptions fail to capture the diverse range of anomalies that may emerge in different objects, and simple patch-level image-text matching often struggles to localize anomalous regions of varying shapes and sizes. To address these issues, this paper proposes the FiLo++ method, which consists of two key components. The first component, Fused Fine-Grained Descriptions (FusDes), utilizes large language models to generate anomaly descriptions for each object category, combines both fixed and learnable prompt templates and applies a runtime prompt filtering method, producing more accurate and task-specific textual descriptions. The second component, Deformable Localization (DefLoc), integrates the vision foundation model Grounding DINO with position-enhanced text descriptions and a Multi-scale Deformable Cross-modal Interaction (MDCI) module, enabling accurate localization of anomalies with various shapes and sizes. In addition, we design a position-enhanced patch matching approach to improve few-shot anomaly detection performance. Experiments on multiple datasets demonstrate that FiLo++ achieves significant performance improvements compared with existing methods. Code will be available at https://github.com/CASIA-IVA-Lab/FiLo.

Problem

Research questions and friction points this paper is trying to address.

Anomaly Detection

Zero-shot Learning

Few-shot Learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

FiLo++

Grounding DINO

Zero-shot/Few-shot Anomaly Detection

🔎 Similar Papers

No similar papers found.

Authors to Follow