SparseOccVLA: Bridging Occupancy and Vision-Language Models via Sparse Queries for Unified 4D Scene Understanding and Planning

📅 2026-01-10
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches struggle to effectively integrate the high-level reasoning capabilities of vision-language models with the fine-grained spatial representations of semantic occupancy, limiting unified 4D scene understanding and planning in autonomous driving. To address this, this work proposes sparse occupancy queries as a unified bridge between visual and linguistic modalities, accompanied by a lightweight sparse occupancy encoder to enable cross-modal alignment. Furthermore, an LLM-guided Anchor-Diffusion planner is introduced, which decouples anchor scoring from denoising to support cross-modal conditional trajectory generation. The proposed method achieves state-of-the-art performance in open-loop planning on nuScenes, with a 7% improvement in CIDEr on OmniDrive-nuScenes and a 0.5-point gain in mIoU on Occ3D-nuScenes.

Technology Category

Application Category

📝 Abstract
In autonomous driving, Vision Language Models (VLMs) excel at high-level reasoning , whereas semantic occupancy provides fine-grained details. Despite significant progress in individual fields, there is still no method that can effectively integrate both paradigms. Conventional VLMs struggle with token explosion and limited spatiotemporal reasoning, while semantic occupancy provides a unified, explicit spatial representation but is too dense to integrate efficiently with VLMs. To address these challenges and bridge the gap between VLMs and occupancy, we propose SparseOccVLA, a novel vision-language-action model that unifies scene understanding, occupancy forecasting, and trajectory planning powered by sparse occupancy queries. Starting with a lightweight Sparse Occupancy Encoder, SparseOccVLA generates compact yet highly informative sparse occupancy queries that serve as the single bridge between vision and language. These queries are aligned into the language space and reasoned by the LLM for unified scene understanding and future occupancy forecasting. Furthermore, we introduce an LLM-guided Anchor-Diffusion Planner featuring decoupled anchor scoring and denoising, as well as cross-model trajectory-condition fusion. SparseOccVLA achieves a 7% relative improvement in CIDEr over the state-of-the-art on OmniDrive-nuScenes, a 0.5 increase in mIoU score on Occ3D-nuScenes, and sets state-of-the-art open-loop planning metric on nuScenes benchmark, demonstrating its strong holistic capability.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
Semantic Occupancy
4D Scene Understanding
Autonomous Driving
Spatiotemporal Reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Occupancy Queries
Vision-Language Models
Semantic Occupancy
LLM-guided Planning
4D Scene Understanding
🔎 Similar Papers
No similar papers found.
Chenxu Dang
Chenxu Dang
Huazhong University of Science and Technology
Computer VisionAutonomous Driving
J
Jie Wang
Xiaomi EV
Guang Li
Guang Li
Assistant Professor, Hokkaido University
Dataset DistillationSelf-Supervised LearningData-Centric AIMedical Image Analysis
Z
Zhiwen Hou
Xiaomi EV
Z
Zihan You
Institute for AI Industry Research (AIR), Tsinghua University
H
Hangjun Ye
Xiaomi EV
J
Jie Ma
Huazhong University of Science and Technology
L
Long Chen
Xiaomi EV
Yan Wang
Yan Wang
Tsinghua university; SenseTime
Neural CompressionComputer VisionMachine Learning