Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

📅 2026-03-05

📈 Citations: 1

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the low zero-shot recall of multimodal large language models (MLLMs) in open-world video anomaly detection by reframing the task as binary classification under weak temporal supervision. The authors propose a language-guided reasoning framework and systematically evaluate MLLM performance on the ShanghaiTech and CHAD benchmarks. They reveal, for the first time, a conservative decision bias inherent in MLLMs for zero-shot anomaly detection and demonstrate that category-specific instructions effectively recalibrate decision boundaries. Through optimized prompt design, recall capability is substantially enhanced: with 1–3 second video clips as input, the F1 score on ShanghaiTech improves from 0.09 to 0.64. Nevertheless, achieving high recall remains a critical challenge for real-world deployment.

📝 Abstract

Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s--3s) influence performance, focusing on the precision--recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the'normal'class, resulting in high precision but a recall collapse that limits practical utility. We demonstrate that class-specific instructions can significantly shift this decision boundary, improving the peak F1-score on ShanghaiTech from 0.09 to 0.64, yet recall remains a critical bottleneck. These results highlight a significant performance gap for MLLMs in noisy environments and provide a foundation for future work in recall-oriented prompting and model calibration for open-world surveillance, which demands complex video understanding and reasoning.

Problem

Research questions and friction points this paper is trying to address.

Multimodal LLMs

Video Anomaly Detection

Zero-Shot Learning

Recall Collapse

Surveillance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLMs

Zero-shot Anomaly Detection

Video Understanding