RenderWorld: World Model with Self-Supervised 3D Label

📅 2024-09-17
🏛️ arXiv.org
📈 Citations: 4
Influential: 1
📄 PDF
🤖 AI Summary
To address low accuracy in 4D occupancy forecasting and motion planning, as well as excessive GPU memory consumption in vision-only end-to-end autonomous driving, this paper proposes the first fully self-supervised framework. Methodologically, it introduces: (1) Img2Occ, a novel self-supervised module leveraging Gaussian splatting for high-fidelity 3D occupancy label generation; (2) an Adaptive Masked Variational Autoencoder (AM-VAE) that separately models occupied and unoccupied spatial features to enhance representational capacity; (3) efficient 3D reconstruction via Gaussian splatting—replacing NeRF—to drastically reduce GPU memory usage; and (4) an autoregressive world model enabling joint optimization of 4D spatiotemporal scene modeling and motion planning. Evaluated on nuScenes and other benchmarks, our method achieves state-of-the-art performance in both 4D occupancy prediction and trajectory planning, with substantial gains in segmentation mIoU and over 60% reduction in GPU memory consumption compared to NeRF-based baselines.

Technology Category

Application Category

📝 Abstract
End-to-end autonomous driving with vision-only is not only more cost-effective compared to LiDAR-vision fusion but also more reliable than traditional methods. To achieve a economical and robust purely visual autonomous driving system, we propose RenderWorld, a vision-only end-to-end autonomous driving framework, which generates 3D occupancy labels using a self-supervised gaussian-based Img2Occ Module, then encodes the labels by AM-VAE, and uses world model for forecasting and planning. RenderWorld employs Gaussian Splatting to represent 3D scenes and render 2D images greatly improves segmentation accuracy and reduces GPU memory consumption compared with NeRF-based methods. By applying AM-VAE to encode air and non-air separately, RenderWorld achieves more fine-grained scene element representation, leading to state-of-the-art performance in both 4D occupancy forecasting and motion planning from autoregressive world model.
Problem

Research questions and friction points this paper is trying to address.

Cost-effective vision-only autonomous driving
Self-supervised 3D occupancy label generation
Improved segmentation and GPU efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised 3D labeling
Gaussian Splatting scene representation
AM-VAE encoding for fine detail
🔎 Similar Papers
No similar papers found.
Ziyang Yan
Ziyang Yan
University of Central Florida | University of Trento | FBK
3D ReconstructionComputer VisionAIGC
W
Wenzhen Dong
Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
Y
Yihua Shao
The University of Science and Technology Beijing, Beijing, China
Y
Yuhang Lu
ShanghaiTech University, Shanghai, China
Haiyang Liu
Haiyang Liu
The University of Tokyo
Human Video GenerationMotion GenerationMulti-Modal Understanding and Generation
J
Jingwen Liu
The University of Science and Technology Beijing, Beijing, China
H
Haozhe Wang
The Hong Kong University of Science and Technology, Hong Kong, China
Z
Zhe Wang
Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
Y
Yan Wang
Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
Fabio Remondino
Fabio Remondino
3D Optical Metrology - Bruno Kessler Foundation
photogrammetry3D modelingAI
Yuexin Ma
Yuexin Ma
Assistant Professor, School of Information Science and Technology, ShanghaiTech University
computer visionembodied AIautonomous driving