Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the challenge of precise anomaly localization and fine-grained semantic interpretation in zero-shot anomaly detection (ZSAD) using multimodal large language models (MLLMs), particularly under data-scarce conditions. We propose Anomaly-OneVision—the first unified framework for ZSAD and interpretable reasoning in data-limited settings. Methodologically, we introduce the novel Look-Twice Feature Matching mechanism, integrating adaptive anomaly token enhancement, cross-modal alignment, and feature reweighting. To support training and evaluation, we construct Anomaly-Instruct-125k—the first vision-instruction tuning dataset for anomaly detection—and the dedicated benchmark VisA-D&R. Experiments demonstrate that Anomaly-OneVision significantly outperforms general-purpose MLLMs (e.g., GPT-4o) on VisA-D&R, achieving state-of-the-art performance in both detection accuracy and reasoning description quality. Furthermore, the framework exhibits strong generalizability to medical imaging and 3D anomaly detection tasks.

Technology Category

Application Category

📝 Abstract

Zero-Shot Anomaly Detection (ZSAD) is an emerging AD paradigm. Unlike the traditional unsupervised AD setting that requires a large number of normal samples to train a model, ZSAD is more practical for handling data-restricted real-world scenarios. Recently, Multimodal Large Language Models (MLLMs) have shown revolutionary reasoning capabilities in various vision tasks. However, the reasoning of image abnormalities remains underexplored due to the lack of corresponding datasets and benchmarks. To facilitate research in AD&reasoning, we establish the first visual instruction tuning dataset, Anomaly-Instruct-125k, and the evaluation benchmark, VisA-D&R. Through investigation with our benchmark, we reveal that current MLLMs like GPT-4o cannot accurately detect and describe fine-grained anomalous details in images. To address this, we propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning. Inspired by human behavior in visual inspection, Anomaly-OV leverages a Look-Twice Feature Matching (LTFM) mechanism to adaptively select and emphasize abnormal visual tokens. Extensive experiments demonstrate that Anomaly-OV achieves significant improvements over advanced generalist models in both detection and reasoning. Extensions to medical and 3D AD are provided for future study. The link to our project page: https://xujiacong.github.io/Anomaly-OV/

Problem

Research questions and friction points this paper is trying to address.

Zero-Shot Anomaly Detection challenges

Multimodal Large Language Models limitations

Visual instruction tuning dataset creation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-Shot Anomaly Detection

Multimodal Large Language Models

Look-Twice Feature Matching

🔎 Similar Papers

Customizing Visual-Language Foundation Models for Multi-modal Anomaly Detection and Reasoning