GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation

📅 2025-12-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current Driving World Models (DWMs) suffer from limited 3D scene understanding and language-driven reasoning, while point cloud and BEV representations inherently struggle with text-3D alignment. To address these limitations, we propose a 3D Gaussian-based Driving World Model that pioneers early-stage cross-modal alignment by embedding language features into each Gaussian primitive. We introduce a task-aware, language-guided sparse sampling strategy and design a dual-conditioned (language + image) multimodal diffusion architecture for generative modeling. By unifying BEV and point cloud representations, our model enables holistic 3D environmental understanding, language-conditioned scene reasoning, and coherent multimodal generation. Evaluated on nuScenes and NuInteract, it achieves state-of-the-art performance, significantly improving both 3D perception accuracy and generation consistency. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality alignment. In addition, we design a novel task-aware language-guided sampling strategy that removes redundant 3D Gaussians and injects accurate and compact 3D tokens into LLM. Furthermore, we design a dual-condition multi-modal generation model, where the information captured by our vision-language model is leveraged as a high-level language condition in combination with a low-level image condition, jointly guiding the multi-modal generation process. We conduct comprehensive studies on the nuScenes, and NuInteract datasets to validate the effectiveness of our framework. Our method achieves state-of-the-art performance. We will release the code publicly on GitHub https://github.com/dtc111111/GaussianDWM.
Problem

Research questions and friction points this paper is trying to address.

Lack of 3D scene understanding in driving world models
Misalignment between textual information and 3D spatial data
Inability to jointly perform understanding and generation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Gaussian scene representation for unified understanding and generation
Embed linguistic features into Gaussian primitives for modality alignment
Dual-condition model with language and image guidance for generation
Tianchen Deng
Tianchen Deng
Shanghai Jiao Tong University
RoboticsComputer Vision
X
Xuefeng Chen
Tsinghua University
Y
Yi Chen
Shanghai Jiao Tong University
Q
Qu Chen
MEGVII Technology, Mach Drive
Y
Yuyao Xu
MEGVII Technology, Mach Drive
L
Lijin Yang
MEGVII Technology, Mach Drive
L
Le Xu
MEGVII Technology, Mach Drive
Y
Yu Zhang
MEGVII Technology, Mach Drive
B
Bo Zhang
MEGVII Technology, Mach Drive
W
Wuxiong Huang
MEGVII Technology, Mach Drive
H
Hesheng Wang
Shanghai Jiao Tong University