No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection

📅 2026-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing video anomaly detection methods in open-world scenarios, where insufficient dataset diversity and limited contextual semantic understanding hinder performance. To overcome these challenges, we propose LAVIDA, the first end-to-end zero-shot video anomaly detection framework that operates without any real anomalous samples during training. LAVIDA leverages segmented objects from normal videos as pseudo-anomalies and integrates a multimodal large language model to enhance semantic comprehension. Additionally, it introduces a reverse-attention-based token compression strategy that simultaneously improves generalization and reduces computational overhead. Extensive experiments demonstrate that LAVIDA achieves state-of-the-art performance in both frame-level and pixel-level zero-shot anomaly detection across four standard video anomaly detection benchmarks.

Technology Category

Application Category

📝 Abstract
The collection and detection of video anomaly data has long been a challenging problem due to its rare occurrence and spatio-temporal scarcity. Existing video anomaly detection (VAD) methods under perform in open-world scenarios. Key contributing factors include limited dataset diversity, and inadequate understanding of context-dependent anomalous semantics. To address these issues, i) we propose LAVIDA, an end-to-end zero-shot video anomaly detection framework. ii) LAVIDA employs an Anomaly Exposure Sampler that transforms segmented objects into pseudo-anomalies to enhance model adaptability to unseen anomaly categories. It further integrates a Multimodal Large Language Model (MLLM) to bolster semantic comprehension capabilities. Additionally, iii) we design a token compression approach based on reverse attention to handle the spatio-temporal scarcity of anomalous patterns and decrease computational cost. The training process is conducted solely on pseudo anomalies without any VAD data. Evaluations across four benchmark VAD datasets demonstrate that LAVIDA achieves SOTA performance in both frame-level and pixel-level anomaly detection under the zero-shot setting. Our code is available in https://github.com/VitaminCreed/LAVIDA.
Problem

Research questions and friction points this paper is trying to address.

video anomaly detection
zero-shot learning
spatio-temporal scarcity
anomalous semantics
open-world scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot Video Anomaly Detection
Multimodal Large Language Model (MLLM)
Pseudo-anomaly Generation
Token Compression
Anomaly Exposure Sampler
🔎 Similar Papers
Z
Zunkai Dai
Beijing University of Posts and Telecommunications
K
Ke Li
Beijing University of Posts and Telecommunications
Jiajia Liu
Jiajia Liu
Ant Group
cv multimodal
J
Jie Yang
Beijing University of Posts and Telecommunications
Y
Yuanyuan Qiao
Beijing University of Posts and Telecommunications