🤖 AI Summary
To address the challenge of automatically identifying hazardous scenarios in unlabeled dashcam videos, this paper proposes a weakly supervised multimodal analysis framework. First, unsupervised acoustic anomaly detection localizes driver stress responses via vocal cues. Second, a weak-rule ensemble integrated with differential privacy calibration enables zero-shot hazardous object recognition and scene understanding without manual annotations. Third, a vision-language model generates fine-grained natural language descriptions of hazards. Our approach is the first to unify acoustic anomaly modeling, privacy-preserving weak supervision, and multimodal generation within a single hazardous perception pipeline. It achieves top performance across all three core tasks—hazardous reaction detection, hazardous object identification, and scene description generation—in the COOOL 2025 Challenge. The implementation is publicly available.
📝 Abstract
This paper presents a novel approach for hazard analysis in dashcam footage, addressing the detection of driver reactions to hazards, the identification of hazardous objects, and the generation of descriptive captions. We first introduce a method for detecting driver reactions through speed and sound anomaly detection, leveraging unsupervised learning techniques. For hazard detection, we employ a set of heuristic rules as weak classifiers, which are combined using an ensemble method. This ensemble approach is further refined with differential privacy to mitigate overconfidence, ensuring robustness despite the lack of labeled data. Lastly, we use state-of-the-art vision-language models for hazard captioning, generating descriptive labels for the detected hazards. Our method achieved the highest scores in the Challenge on Out-of-Label in Autonomous Driving, demonstrating its effectiveness across all three tasks. Source codes are publicly available at https://github.com/ffyyytt/COOOL_2025.