Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing semi-supervised video anomaly detection methods struggle to identify complex anomalies involving multi-object interactions and lack interpretability. To address this, we propose the first explainable interaction anomaly detection framework grounded in multimodal large language models (MLLMs). Our method leverages an MLLM to generate high-level semantic activity descriptions from temporal visual inputs of object pairs in videos, thereby constructing behavioral semantic trajectories; anomalies are then detected via temporal similarity comparison of text embeddings derived from these descriptions. This work pioneers the integration of MLLMs into video interaction modeling, uniquely combining strong interpretability—through natural-language attributions—with high detection accuracy. Evaluated on multiple benchmark datasets, our approach achieves state-of-the-art performance, significantly improving detection of interaction-centric anomalies such as physical conflicts and erroneous collaboration.

Technology Category

Application Category

📝 Abstract

Existing semi-supervised video anomaly detection (VAD) methods often struggle with detecting complex anomalies involving object interactions and generally lack explainability. To overcome these limitations, we propose a novel VAD framework leveraging Multimodal Large Language Models (MLLMs). Unlike previous MLLM-based approaches that make direct anomaly judgments at the frame level, our method focuses on extracting and interpreting object activity and interactions over time. By querying an MLLM with visual inputs of object pairs at different moments, we generate textual descriptions of the activity and interactions from nominal videos. These textual descriptions serve as a high-level representation of the activity and interactions of objects in a video. They are used to detect anomalies during test time by comparing them to textual descriptions found in nominal training videos. Our approach inherently provides explainability and can be combined with many traditional VAD methods to further enhance their interpretability. Extensive experiments on benchmark datasets demonstrate that our method not only detects complex interaction-based anomalies effectively but also achieves state-of-the-art performance on datasets without interaction anomalies.

Problem

Research questions and friction points this paper is trying to address.

Detecting complex object interaction anomalies in videos

Providing explainable video anomaly detection through activity descriptions

Enhancing traditional methods with multimodal language model interpretations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses MLLM to describe object activity and interactions

Generates textual descriptions from nominal video pairs

Compares test descriptions to nominal ones for detection

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs