AD-DINOv3: Enhancing DINOv3 for Zero-Shot Anomaly Detection with Anomaly-Aware Calibration

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
DINOv3 faces two key challenges when applied to zero-shot anomaly detection (ZSAD): domain shift between pretraining and downstream tasks, and an inherent bias toward global semantics that hinders localization of fine-grained anomalies. Method: This work introduces DINOv3 to ZSAD for the first time, proposing a vision-language multimodal framework. It jointly leverages DINOv3’s image patch and [CLS] token embeddings with CLIP’s text encoder to generate prompt embeddings for normal and anomalous classes. A lightweight anomaly-aware calibration module is further designed to explicitly steer attention toward local abnormal regions, mitigating feature misalignment and semantic masking. Contribution/Results: Extensive experiments across eight industrial and medical benchmarks demonstrate consistent performance on par with or superior to state-of-the-art methods. The framework proves effective, generalizable, and truly zero-shot—requiring no anomalous training samples—thus establishing a robust, universal ZSAD solution.

Technology Category

Application Category

📝 Abstract
Zero-Shot Anomaly Detection (ZSAD) seeks to identify anomalies from arbitrary novel categories, offering a scalable and annotation-efficient solution. Traditionally, most ZSAD works have been based on the CLIP model, which performs anomaly detection by calculating the similarity between visual and text embeddings. Recently, vision foundation models such as DINOv3 have demonstrated strong transferable representation capabilities. In this work, we are the first to adapt DINOv3 for ZSAD. However, this adaptation presents two key challenges: (i) the domain bias between large-scale pretraining data and anomaly detection tasks leads to feature misalignment; and (ii) the inherent bias toward global semantics in pretrained representations often leads to subtle anomalies being misinterpreted as part of the normal foreground objects, rather than being distinguished as abnormal regions. To overcome these challenges, we introduce AD-DINOv3, a novel vision-language multimodal framework designed for ZSAD. Specifically, we formulate anomaly detection as a multimodal contrastive learning problem, where DINOv3 is employed as the visual backbone to extract patch tokens and a CLS token, and the CLIP text encoder provides embeddings for both normal and abnormal prompts. To bridge the domain gap, lightweight adapters are introduced in both modalities, enabling their representations to be recalibrated for the anomaly detection task. Beyond this baseline alignment, we further design an Anomaly-Aware Calibration Module (AACM), which explicitly guides the CLS token to attend to anomalous regions rather than generic foreground semantics, thereby enhancing discriminability. Extensive experiments on eight industrial and medical benchmarks demonstrate that AD-DINOv3 consistently matches or surpasses state-of-the-art methods, verifying its superiority as a general zero-shot anomaly detection framework.
Problem

Research questions and friction points this paper is trying to address.

Adapting DINOv3 for zero-shot anomaly detection
Addressing feature misalignment from domain bias
Correcting misinterpretation of subtle anomalies as normal
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts DINOv3 with lightweight adapters for domain alignment
Uses multimodal contrastive learning with visual-text embeddings
Introduces Anomaly-Aware Calibration Module for anomaly focus
🔎 Similar Papers