Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles

πŸ“… 2025-12-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the ambiguity, strong contextual dependency, and challenges in modeling 3D spatial relations and dynamic scene evolution inherent in natural language instructions for autonomous driving, this paper proposes ThinkDeeperβ€”a novel framework that introduces world models into vision-language grounding. It constructs a Spatially-Aware World Model (SA-WM) and integrates a hypergraph decoder to enable future-state-reasoning-based localization decisions. Methodologically, it innovatively unifies instruction-aware latent state modeling, forward sequence prediction, and hypergraph-guided hierarchical multimodal fusion, while leveraging RAG-augmented LLMs to generate high-quality semantic annotations. Evaluated on multiple benchmarks, ThinkDeeper achieves state-of-the-art performance: first place on Talk2Car, and consistent superiority over prior art on DrivePilot, MoCAD, and RefCOCO variants. Notably, it attains optimal accuracy using only 50% of standard training data, demonstrating significantly enhanced robustness and generalization.

Technology Category

Application Category

πŸ“ Abstract
Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.
Problem

Research questions and friction points this paper is trying to address.

Interpreting ambiguous natural-language commands for autonomous vehicle object localization.
Addressing lack of 3D spatial reasoning and scene evolution anticipation in grounding.
Enhancing robustness in challenging scenes like long-text and multi-agent scenarios.
Innovation

Methods, ideas, or system contributions that make the work stand out.

World model predicts future spatial states for disambiguation
Hypergraph decoder fuses multimodal inputs hierarchically
RAG and CoT LLM pipeline generates dataset annotations
πŸ”Ž Similar Papers
No similar papers found.
H
Haicheng Liao
University of Macau
H
Huanming Shen
UESTC
Bonan Wang
Bonan Wang
Unknown affiliation
Trajectory Prediction
Y
Yongkang Li
Purdue University
Yihong Tang
Yihong Tang
McGill University
Look for mein the momentsbetween thoughts πŸ¦‹
C
Chengyue Wang
University of Macau
D
Dingyi Zhuang
Massachusetts Institute of Technology
Kehua Chen
Kehua Chen
Postdoc @ University of Washington | Ph.D. @ HKUST
Urban ComputingAutonomous DrivingIntelligent Transportation Systems
H
Hai Yang
The Hong Kong University of Science and Technology
C
Chengzhong Xu
University of Macau
Z
Zhenning Li
University of Macau