Driving Video Retrieval for Complex Queries with Structured Grounding

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing methods struggle to accurately retrieve complex dynamic events in autonomous driving videos—such as cut-ins and emergency braking—due to missing critical motion cues in textual descriptions and the limited robustness of handcrafted rules. This work proposes STRIVE-D, a novel framework that introduces, for the first time, a data-driven rule calibration mechanism. It leverages weakly labeled videos to adaptively refine unreliable query rules and dynamically fuses the calibrated rule scores with signals from vision-language models and keyword matching. Evaluated on three driving benchmarks, STRIVE-D substantially improves recall for complex events, achieving up to an 84% relative gain in top-1 accuracy. Its effectiveness is further validated on DrivingDojo, a newly released dataset with human-annotated event labels.

📝 Abstract

Video retrieval at scale is central to data curation and safety validation in autonomous driving, where users want to find not only scenes but also dynamic events such as cut-ins and hard braking. Existing vision-language and keyword-based retrieval methods often miss these events because the relevant motion may not be explicitly described in text or captured by lexical overlap. Rule-based retrieval can encode such events more directly, but it is brittle: generated or hand-written rules often fail when their assumptions do not match real driving data. We propose STRIVE-D, a data-calibrated retrieval framework for driving videos. It uses weakly labeled in-domain videos to estimate when a query rule is reliable, adapt rules that mismatch observed data, and fuse calibrated rule scores with vision-language and keyword-based retrieval signals. Across three driving benchmarks, including newly released human-annotated event data on DrivingDojo, STRIVE-D delivers up to 84% relative improvement in top-1 accuracy over state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

video retrieval

autonomous driving

dynamic events

rule-based retrieval

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

video retrieval

rule calibration

autonomous driving