PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limited behavioral diversity of non-ego traffic participants in existing closed-loop driving simulators, which often fail to reflect the richness of human driving styles. To bridge this gap, the authors introduce the first human driving dataset annotated with explicit style instructions—aggressive, neutral, and conservative—and propose a retrieval-augmented vision-language-action (VLA) architecture. This framework leverages offline indexing of image-text similarity, a lightweight retrieval head, fusion of frozen visual features with a control encoder, and in-context demonstration-based fine-tuning to enable dynamic driving style switching without retraining. Evaluated on Bench2Drive, the method outperforms SimLingo by 4.6% and HiP-AD by 2.5% under style-agnostic conditions, and consistently surpasses baselines across all specified styles; notably, the aggressive style achieves 18% higher average speed and 25% greater acceleration compared to the conservative style.

📝 Abstract

Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by rule-based traffic managers or by learned models trained toward a single behavioral mode. Recent work introduces style variation through post-hoc labels on observational data or LLM-inferred reward weights, but these signals act as proxies for what a style should reward rather than demonstrations of humans explicitly asked to drive in that style. We introduce PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, in which participants drive CARLA leaderboard routes under aggressive, neutral, and conservative instructions on a driver-in-the-loop rig. The pipeline has three stages: (i) offline triplet mining over per-style human driving data using a combined image-text similarity score; (ii) training a lightweight retrieval head that fuses frozen visual features with a small control encoder over per-style databases; and (iii) fine-tuning a single VLA backbone to treat retrieved context points as in-context behavioral demonstrations during waypoint prediction. At inference, the same backbone is conditioned on any style by swapping which per-style database the retrieval head queries, so selecting a style requires no per-style retraining while enabling human-style, style-diverse non-ego agents for closed-loop simulation. On Bench2Drive, PersonaDrive (no style) improves the driving score by 4.6% over SimLingo and 2.5% over HiP-AD, and under style conditioning attains the highest driving score in every style within a roughly 2% band (its weakest style surpassing the strongest baseline, DMW, by 5.4%), while average speed and acceleration rise by 18% and 25% from the conservative to the aggressive instruction.

Problem

Research questions and friction points this paper is trying to address.

closed-loop driving simulation

driving style diversity

human-style driving agents

non-ego traffic agents

behavioral variation

Innovation

Methods, ideas, or system contributions that make the work stand out.

retrieval-augmented VLA

human-style driving

style-conditioned simulation