🤖 AI Summary
This work addresses the challenge of training vision-language driving models under limited annotation budgets, as existing approaches rely on costly and unstructured reasoning supervision. To overcome this, the authors propose VeriDrive, a framework that constructs planning-oriented, verifiable counterfactual supervision by introducing a structured reasoning chain—comprising perception, evaluation, and correction—and integrating a selective refinement mechanism guided by local generation and a verifier. This approach defines auditable intermediate fields and structured correction targets, enabling low-cost yet effective supervision for vision-language planning. Evaluated on the nuScenes dataset under the Omni-Q protocol, VeriDrive outperforms OmniDrive in L2 error, collision rate, and intersection performance while substantially reducing token consumption, generation time, and the number of LLM/VLM invocations.
📝 Abstract
Vision-language driving models increasingly use reasoning supervision to bridge perception, prediction, and planning, but existing driving rationales are often free-form and expensive to generate with frontier models. We present VeriDrive, a framework for constructing planning-oriented, verifiable counterfactual supervision. VeriDrive converts driving reasoning into a structured Perception-Evaluation-Revision chain that grounds key objects in future motion, evaluates alternative ego trajectories with rule-checkable evidence, revises risky intent toward expert behavior, and produces final planning targets. To scale data construction, VeriDrive combines local generation with validator-guided selective correction, escalating only invalid or difficult samples. We build the VeriDrive dataset on nuScenes and train under the Omni-Q protocol. Controlled open-loop experiments show that VeriDrive improves L2, Collision, and Intersection over OmniDrive while reducing logged token usage, generation time, and actual paid LLM/VLM cost. These results show that auditable intermediate fields and structured revision targets can improve vision-language planning supervision under realistic annotation budgets. Code, prompts, and validator scripts are coming soon and will be released after the review process.