AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM

📅 2025-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video anomaly detection (VAD) methods rely heavily on modeling normal patterns, resulting in poor generalization across unseen scenes and necessitating costly retraining for each new domain. This work introduces Customizable Video Anomaly Detection (C-VAD), a novel paradigm enabling zero-shot, natural language–driven anomaly detection: given only a textual description of the target anomaly, the model localizes anomalous frames in previously unseen scenes in real time—without fine-tuning, retraining, or domain-specific data. Our approach leverages a frozen large vision-language model (LVLM), integrating context-aware visual question answering (VQA) with advanced prompt engineering to eliminate dependence on normal-pattern modeling and domain adaptation. Evaluated on our newly constructed C-VAD benchmark and the UBnormal dataset, C-VAD achieves state-of-the-art performance, demonstrating superior cross-dataset generalization compared to all existing methods.

Technology Category

Application Category

📝 Abstract
Video anomaly detection (VAD) is crucial for video analysis and surveillance in computer vision. However, existing VAD models rely on learned normal patterns, which makes them difficult to apply to diverse environments. Consequently, users should retrain models or develop separate AI models for new environments, which requires expertise in machine learning, high-performance hardware, and extensive data collection, limiting the practical usability of VAD. To address these challenges, this study proposes customizable video anomaly detection (C-VAD) technique and the AnyAnomaly model. C-VAD considers user-defined text as an abnormal event and detects frames containing a specified event in a video. We effectively implemented AnyAnomaly using a context-aware visual question answering without fine-tuning the large vision language model. To validate the effectiveness of the proposed model, we constructed C-VAD datasets and demonstrated the superiority of AnyAnomaly. Furthermore, our approach showed competitive performance on VAD benchmark datasets, achieving state-of-the-art results on the UBnormal dataset and outperforming other methods in generalization across all datasets. Our code is available online at github.com/SkiddieAhn/Paper-AnyAnomaly.
Problem

Research questions and friction points this paper is trying to address.

Detects video anomalies without retraining for new environments.
Uses user-defined text to identify abnormal events in videos.
Achieves state-of-the-art performance on VAD benchmark datasets.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot customizable video anomaly detection
Uses context-aware visual question answering
No fine-tuning of large vision language model
🔎 Similar Papers
No similar papers found.