DriveReward: A Comprehensive Dataset and Generative Vision-Language Reward Model for Autonomous Driving

📅 2026-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generalization of existing autonomous driving reward models, which rely on handcrafted rules or ground-truth perception and struggle to effectively capture failure cases and diverse driving behaviors. To overcome these limitations, the authors introduce the DriveReward dataset, featuring temporally aligned visual-guided annotations and counterfactual driving behaviors, along with a specialized vision-language reward model. By integrating counterfactual data augmentation, reinforcement learning fine-tuning, and multimodal trajectory scoring, the proposed model—despite having only 1 billion parameters—surpasses larger general-purpose vision-language models (VLMs) in task alignment. It achieves performance comparable to rule-based rewards in both open-loop and closed-loop evaluations, thereby revealing for the first time the inherent limitations of general VLMs in driving reward modeling.
📝 Abstract
Reward models play a pivotal role in reinforcement learning (RL) and multi-modal trajectory selection for autonomous driving. However, acquiring such rewards typically relies on hand-crafted rule-based objectives or perception ground truth, which hinders generalization for data-scaling. While Vision-Language Models (VLMs) have demonstrated feasibility as reward models in other domains, their effectiveness in driving tasks remains underexplored. In this work, we bridge this gap by (1) introducing DriveReward, a reasoning trajectory evaluation dataset rigorously labeled via temporally-grounded visual guidance, and augmented with counterfactual driving behaviors., (2) alongside a specialized Vision-Language Reward Model. To address the scarcity of failure cases in conventional datasets, we propose a counterfactual data annotation scheme to construct cases encompassing diverse driving styles and erroneous behaviors. Evaluations on our proposed benchmark reveal that even leading open-source and proprietary VLMs fail to excel across all tasks, highlighting significant room for improvement in existing models. Building on these findings, we subsequently tailor a specialized 1B reward model that outperforms larger VLMs on task-specific reward alignment. Finally, we validate our reward model's effectiveness by integrating it into RL finetuning and multi-modal trajectory scoring across multiple baselines, achieving performance comparable to rule-based reward calculations in both open-loop and closed-loop evaluation.
Problem

Research questions and friction points this paper is trying to address.

reward model
autonomous driving
vision-language model
trajectory evaluation
counterfactual data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Reward Model
Counterfactual Data Annotation
Autonomous Driving
Reward Alignment
DriveReward Dataset