SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-grained understanding of surveillance videos represents a safety-critical challenge in vision-language research, hindered by real-world complexity, dynamic event modeling, and severe annotation scarcity. To address this, we introduce SurveillanceVQA-589K—the first large-scale, open-ended video question answering benchmark tailored for security scenarios—comprising 589K QA pairs across 12 cognitive task categories (e.g., temporal reasoning, causal inference, anomaly comprehension) and covering both normal and anomalous video clips. We propose a hybrid annotation pipeline integrating human-verified temporal alignment of captions with LVLM-guided QA generation, and design a multi-dimensional structured evaluation protocol. Our benchmark reveals, for the first time, that eight state-of-the-art LVLMs underperform humans by 32.7% on average in causal and anomaly understanding. SurveillanceVQA-589K establishes a reproducible, extensible evaluation infrastructure for safety-critical vision-language applications.

Technology Category

Application Category

📝 Abstract
Understanding surveillance video content remains a critical yet underexplored challenge in vision-language research, particularly due to its real-world complexity, irregular event dynamics, and safety-critical implications. In this work, we introduce SurveillanceVQA-589K, the largest open-ended video question answering benchmark tailored to the surveillance domain. The dataset comprises 589,380 QA pairs spanning 12 cognitively diverse question types, including temporal reasoning, causal inference, spatial understanding, and anomaly interpretation, across both normal and abnormal video scenarios. To construct the benchmark at scale, we design a hybrid annotation pipeline that combines temporally aligned human-written captions with Large Vision-Language Model-assisted QA generation using prompt-based techniques. We also propose a multi-dimensional evaluation protocol to assess contextual, temporal, and causal comprehension. We evaluate eight LVLMs under this framework, revealing significant performance gaps, especially in causal and anomaly-related tasks, underscoring the limitations of current models in real-world surveillance contexts. Our benchmark provides a practical and comprehensive resource for advancing video-language understanding in safety-critical applications such as intelligent monitoring, incident analysis, and autonomous decision-making.
Problem

Research questions and friction points this paper is trying to address.

Addressing underexplored surveillance video-language understanding challenges
Creating largest surveillance-specific open-ended video QA benchmark
Evaluating LVLMs' limitations in causal and anomaly comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid annotation pipeline with human captions
Large Vision-Language Model-assisted QA generation
Multi-dimensional evaluation protocol for comprehension
🔎 Similar Papers
No similar papers found.
B
Bo Liu
Department of Computer Science, Beijing University of Technology
P
Pengfei Qiao
Department of Computer Science, Beijing University of Technology
M
Minhan Ma
Department of Computer Science, Beijing University of Technology
X
Xuange Zhang
Department of Computer Science, Beijing University of Technology
Yinan Tang
Yinan Tang
浪潮信息
Data Center NetworkDistributed Machine LearningMachine learningNLPknowledge graph
P
Peng Xu
Department of Electronic Engineering, Tsinghua University
K
Kun Liu
JD Explore Academy
T
Tongtong Yuan
Department of Computer Science, Beijing University of Technology