🤖 AI Summary
Automated accident detection and understanding in 24/7 traffic surveillance video remains challenging due to spatiotemporal complexity and semantic ambiguity.
Method: This paper proposes SeeUnsafe, a multimodal large language model (MLLM)-driven accident analysis framework. It introduces a severity-aware dynamic video segmentation and aggregation strategy, integrated with visual grounding and structured multimodal prompting to enable fine-grained accident localization and natural-language interactive interpretation. We further propose the Information Matching Score (IMS) as an interpretable evaluation metric for structured alignment between model outputs and ground-truth annotations.
Results: Evaluated on the Toyota Woven Traffic Safety dataset, SeeUnsafe achieves significant improvements in accident classification and visual localization accuracy. It supports zero-shot transfer and user-defined queries without manual post-processing. The framework establishes a new paradigm for intelligent traffic monitoring—efficient, interpretable, and scalable.
📝 Abstract
The increasing availability of traffic videos functioning on a 24/7/365 time scale has the great potential of increasing the spatio-temporal coverage of traffic accidents, which will help improve traffic safety. However, analyzing footage from hundreds, if not thousands, of traffic cameras in a 24/7/365 working protocol remains an extremely challenging task, as current vision-based approaches primarily focus on extracting raw information, such as vehicle trajectories or individual object detection, but require laborious post-processing to derive actionable insights. We propose SeeUnsafe, a new framework that integrates Multimodal Large Language Model (MLLM) agents to transform video-based traffic accident analysis from a traditional extraction-then-explanation workflow to a more interactive, conversational approach. This shift significantly enhances processing throughput by automating complex tasks like video classification and visual grounding, while improving adaptability by enabling seamless adjustments to diverse traffic scenarios and user-defined queries. Our framework employs a severity-based aggregation strategy to handle videos of various lengths and a novel multimodal prompt to generate structured responses for review and evaluation and enable fine-grained visual grounding. We introduce IMS (Information Matching Score), a new MLLM-based metric for aligning structured responses with ground truth. We conduct extensive experiments on the Toyota Woven Traffic Safety dataset, demonstrating that SeeUnsafe effectively performs accident-aware video classification and visual grounding by leveraging off-the-shelf MLLMs. Source code will be available at url{https://github.com/ai4ce/SeeUnsafe}.