🤖 AI Summary
Video anomaly detection (VAD) faces critical challenges in the large model era, including scarce labeled data, insufficient multimodal fusion, fragmented architectures, and ambiguous task objectives. Method: This paper proposes the first unified VAD framework integrating deep neural networks (DNNs) with multimodal large language models (MLLMs) and large language models (LLMs). It systematically analyzes technical evolution across four dimensions—data annotation paradigms, input modality combinations, model architecture design, and task objective formalization—and establishes a novel taxonomy encompassing both classical and large-model-based approaches. Contribution/Results: The study identifies the paradigm shift from unimodal supervised learning to multimodal collaborative reasoning and from end-to-end discriminative modeling to semantics-driven understanding. It pinpoints key bottlenecks—including feature alignment and joint temporal-semantic modeling—and provides a systematic roadmap for algorithmic innovation in intelligent surveillance and public safety applications.
📝 Abstract
Video anomaly detection (VAD) aims to identify and ground anomalous behaviors or events in videos, serving as a core technology in the fields of intelligent surveillance and public safety. With the advancement of deep learning, the continuous evolution of deep model architectures has driven innovation in VAD methodologies, significantly enhancing feature representation and scene adaptability, thereby improving algorithm generalization and expanding application boundaries. More importantly, the rapid development of multi-modal large language (MLLMs) and large language models (LLMs) has introduced new opportunities and challenges to the VAD field. Under the support of MLLMs and LLMs, VAD has undergone significant transformations in terms of data annotation, input modalities, model architectures, and task objectives. The surge in publications and the evolution of tasks have created an urgent need for systematic reviews of recent advancements. This paper presents the first comprehensive survey analyzing VAD methods based on MLLMs and LLMs, providing an in-depth discussion of the changes occurring in the VAD field in the era of large models and their underlying causes. Additionally, this paper proposes a unified framework that encompasses both deep neural network (DNN)-based and LLM-based VAD methods, offering a thorough analysis of the new VAD paradigms empowered by LLMs, constructing a classification system, and comparing their strengths and weaknesses. Building on this foundation, this paper focuses on current VAD methods based on MLLMs/LLMs. Finally, based on the trajectory of technological advancements and existing bottlenecks, this paper distills key challenges and outlines future research directions, offering guidance for the VAD community.