PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing large multimodal video models struggle to deliver precise, low false-alarm active safety warnings in real-world streaming video and lack effective evaluation benchmarks. This work proposes the first streaming video evaluation framework tailored for active safety预警, introducing a benchmark dataset of 740 videos spanning driving, medical, daily-life, and industrial scenarios, annotated with frame-level risk onset and accident boundaries to jointly assess temporal accuracy, content correctness, and false alarm rate. Experiments on 13 state-of-the-art multimodal large language models reveal that even the best-performing model achieves less than 20% accuracy under strict metrics; while models perform relatively well in everyday settings, they suffer from severe false alarms in complex scenarios such as driving. Furthermore, recall and false alarm rates are significantly correlated (r = 0.64), indicating a reliance on superficial activity cues rather than deep risk reasoning.

📝 Abstract

Between the first visible sign of danger and the moment an accident occurs, there is often a window where intervention remains possible. Video-capable multimodal large language models (MLLMs) could serve as always-on safety monitors that issue warnings during this window. Yet current benchmarks do not test this ability: they rely on static inputs, ignore timing precision, and omit false-positive measurement on safe scenes. We present PaSBench-Video, a 740-video benchmark with 481 risk and 259 no-risk videos across four domains: driving, healthcare, daily life, and industrial production. Risk videos are annotated with frame-level risk onset and accident boundaries. A model must observe the video causally and produce a warning that is both temporally calibrated and content-correct. Testing 13 MLLMs, we find that no model exceeds 20.0% on our strictest metric, and recall is tightly coupled with false-positive rate, with Pearson correlation 0.64: higher detection comes only at the cost of triggering warnings on the majority of safe clips. Performance splits sharply by domain: models achieve moderate recall at low false-positive rates in daily life, where risks are inherently anomalous, yet fire indiscriminately in driving, where routine and hazardous scenes look alike. These results indicate that current models rely on scene-level activity cues rather than reasoning about emerging harm.

Problem

Research questions and friction points this paper is trying to address.

proactive safety warning

streaming video benchmark

false-positive rate

temporal calibration

multimodal large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming video benchmark

proactive safety warning

temporal calibration