Hierarchical Reasoning with Vision-Language Models for Incident Reports from Dashcam Videos

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Autonomous driving systems exhibit limited capability in understanding accidents and generating interpretable, explanatory reports for out-of-distribution (OOD) hazardous scenarios. To address this, we propose a multi-level vision-language reasoning framework that jointly performs frame-level fine-grained captioning, critical event frame detection, and natural language causal reasoning. We further introduce model ensembling and blind-evaluation protocols to enhance factual accuracy and human readability of generated accident reports. Our method builds upon an end-to-end trainable vision-language model (VLM), optimized holistically for caption generation, event localization, and narrative coherence using the CIDEr-D metric. Evaluated on the 2COOOL benchmark, our approach ranks second among 29 participating teams and achieves the highest CIDEr-D score. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Recent advances in end-to-end (E2E) autonomous driving have been enabled by training on diverse large-scale driving datasets, yet autonomous driving models still struggle in out-of-distribution (OOD) scenarios. The COOOL benchmark targets this gap by encouraging hazard understanding beyond closed taxonomies, and the 2COOOL challenge extends it to generating human-interpretable incident reports. We present a hierarchical reasoning framework for incident report generation from dashcam videos that integrates frame-level captioning, incident frame detection, and fine-grained reasoning within vision-language models (VLMs). We further improve factual accuracy and readability through model ensembling and a Blind A/B Scoring selection protocol. On the official 2COOOL open leaderboard, our method ranks 2nd among 29 teams and achieves the best CIDEr-D score, producing accurate and coherent incident narratives. These results indicate that hierarchical reasoning with VLMs is a promising direction for accident analysis and for broader understanding of safety-critical traffic events. The implementation and code are available at https://github.com/riron1206/kaggle-2COOOL-2nd-Place-Solution.

Problem

Research questions and friction points this paper is trying to address.

Generating human-interpretable incident reports from dashcam videos

Improving autonomous driving performance in out-of-distribution scenarios

Enhancing factual accuracy and readability of accident narratives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical reasoning framework with VLMs

Frame-level captioning and incident detection

Ensembling with A/B scoring for accuracy

🔎 Similar Papers

ScVLM: Enhancing Vision-Language Model for Safety-Critical Event Understanding