VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitation of existing vision-language-action (VLA) models in dense 3D driving scenarios, where the absence of explicit geometric modeling hinders effective grounding of behavioral decisions. To overcome this, the authors propose VLGA, a novel framework that introduces geometry as a fourth modality into the VLA paradigm. By incorporating a LiDAR-supervised, pixel-level point map regression loss, the model explicitly supervises a dedicated geometry expert module, enabling joint optimization of dense 3D scene reconstruction and driving action prediction. Evaluated on the nuScenes open-loop benchmark, VLGA achieves state-of-the-art performance among VLA methods with a trajectory L2 error of 0.50 meters and a collision rate of 0.18%. Furthermore, it attains a driving score of 79.08 on the Bench2Drive closed-loop evaluation, significantly outperforming current VLA approaches.

📝 Abstract

Vision-language-action (VLA) models can describe scenes and reason about them in language, yet still struggle to ground their actions in the dense 3D world around them. Existing approaches either inject features from a frozen 3D foundation model without an objective that ensures the policy uses them, or constrain geometry with sparse box and map losses that provide no dense spatial signal. We introduce VLGA, the first vision-language-action model supervised to reconstruct the dense 3D world it drives through. VLGA introduces geometry as a fourth modality alongside vision, language, and action through a dedicated expert supervised by a per-pixel pointmap regression loss against LiDAR. Extensive experiments conducted on challenging nuScenes and Bench2Drive datasets for open-loop and closed-loop evaluations, respectively, show the superiority of VLGA over counterpart VLA methods. In particular, on open-loop nuScenes, VLGA sets a new state of the art among VLA methods without ego status, with the lowest L2 (0.50\,m average) and 3-second collision rate (0.18\%). On closed-loop Bench2Drive, VLGA attains the state-of-the-art driving score of 79.08, +0.71 over the strongest prior VLA, at comparable efficiency and comfort.

Problem

Research questions and friction points this paper is trying to address.

vision-language-action

autonomous driving

3D geometry grounding

dense spatial signal

action grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

dense 3D reconstruction

vision-language-action models

geometry grounding