VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video Anomaly Understanding (VAU) suffers from weak fine-grained spatiotemporal perception, insufficient causal reasoning capability, and a lack of interpretable evaluation benchmarks. To address these challenges, we propose the first reinforcement learning (RL)-driven multimodal large language model (MLLM) fine-tuning framework tailored for VAU, integrating video-text joint modeling, spatiotemporal grounding, and chain-of-thought (CoT) prompting. We further introduce VAU-Bench—the first dedicated chain-reasoning benchmark for VAU—comprising multiple-choice question answering, temporal localization, and attribution description tasks. Our method substantially improves anomaly question-answering accuracy, temporal localization precision, and reasoning consistency, enabling interpretable and robust VAU across diverse scenarios. Key contributions include: (1) a scalable RL-based MLLM fine-tuning paradigm; (2) the first multi-dimensional VAU reasoning evaluation benchmark; and (3) a unified reasoning architecture jointly incorporating causal reasoning and fine-grained spatiotemporal understanding.

Technology Category

Application Category

📝 Abstract
Video Anomaly Understanding (VAU) is essential for applications such as smart cities, security surveillance, and disaster alert systems, yet remains challenging due to its demand for fine-grained spatio-temporal perception and robust reasoning under ambiguity. Despite advances in anomaly detection, existing methods often lack interpretability and struggle to capture the causal and contextual aspects of abnormal events. This limitation is further compounded by the absence of comprehensive benchmarks for evaluating reasoning ability in anomaly scenarios. To address both challenges, we introduce VAU-R1, a data-efficient framework built upon Multimodal Large Language Models (MLLMs), which enhances anomaly reasoning through Reinforcement Fine-Tuning (RFT). Besides, we propose VAU-Bench, the first Chain-of-Thought benchmark tailored for video anomaly reasoning, featuring multiple-choice QA, detailed rationales, temporal annotations, and descriptive captions. Empirical results show that VAU-R1 significantly improves question answering accuracy, temporal grounding, and reasoning coherence across diverse contexts. Together, our method and benchmark establish a strong foundation for interpretable and reasoning-aware video anomaly understanding. Our code is available at https://github.com/GVCLab/VAU-R1.
Problem

Research questions and friction points this paper is trying to address.

Enhancing video anomaly reasoning with Reinforcement Fine-Tuning
Addressing lack of interpretability in anomaly detection methods
Introducing first benchmark for video anomaly reasoning evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Fine-Tuning enhances anomaly reasoning
Multimodal Large Language Models improve interpretability
VAU-Bench provides comprehensive anomaly evaluation