VeriDrive: Verifiable Counterfactual Supervision for Cost-Efficient Vision-Language Planning

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of training vision-language driving models under limited annotation budgets, as existing approaches rely on costly and unstructured reasoning supervision. To overcome this, the authors propose VeriDrive, a framework that constructs planning-oriented, verifiable counterfactual supervision by introducing a structured reasoning chain—comprising perception, evaluation, and correction—and integrating a selective refinement mechanism guided by local generation and a verifier. This approach defines auditable intermediate fields and structured correction targets, enabling low-cost yet effective supervision for vision-language planning. Evaluated on the nuScenes dataset under the Omni-Q protocol, VeriDrive outperforms OmniDrive in L2 error, collision rate, and intersection performance while substantially reducing token consumption, generation time, and the number of LLM/VLM invocations.

📝 Abstract

Vision-language driving models increasingly use reasoning supervision to bridge perception, prediction, and planning, but existing driving rationales are often free-form and expensive to generate with frontier models. We present VeriDrive, a framework for constructing planning-oriented, verifiable counterfactual supervision. VeriDrive converts driving reasoning into a structured Perception-Evaluation-Revision chain that grounds key objects in future motion, evaluates alternative ego trajectories with rule-checkable evidence, revises risky intent toward expert behavior, and produces final planning targets. To scale data construction, VeriDrive combines local generation with validator-guided selective correction, escalating only invalid or difficult samples. We build the VeriDrive dataset on nuScenes and train under the Omni-Q protocol. Controlled open-loop experiments show that VeriDrive improves L2, Collision, and Intersection over OmniDrive while reducing logged token usage, generation time, and actual paid LLM/VLM cost. These results show that auditable intermediate fields and structured revision targets can improve vision-language planning supervision under realistic annotation budgets. Code, prompts, and validator scripts are coming soon and will be released after the review process.

Problem

Research questions and friction points this paper is trying to address.

vision-language planning

counterfactual supervision

driving rationale

cost-efficient annotation

verifiable reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

verifiable supervision

counterfactual reasoning

vision-language planning