ProtoOcc: Accurate, Efficient 3D Occupancy Prediction Using Dual Branch Encoder-Prototype Query Decoder

📅 2024-12-11
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses real-time 3D scene understanding for autonomous driving, proposing Occ3D to jointly predict 3D voxel occupancy and semantics from single-frame inputs while balancing accuracy and efficiency. Methodologically: (1) a dual-branch encoder fuses bird’s-eye-view (BEV) and multi-scale voxel representations; (2) a scene-adaptive/scene-agnostic prototype query mechanism replaces computationally expensive iterative Transformer decoding; (3) robust prototype learning, exponential moving average (EMA) weight updating, and noise-injection–denoising training enhance generalization. On the nuScenes Occ3D benchmark, Occ3D achieves a state-of-the-art 45.02% mIoU. For single-frame inference, it attains 39.56% mIoU at 12.83 FPS on an RTX 3090—significantly outperforming existing real-time methods in both accuracy and speed.

Technology Category

Application Category

📝 Abstract
In this paper, we introduce ProtoOcc, a novel 3D occupancy prediction model designed to predict the occupancy states and semantic classes of 3D voxels through a deep semantic understanding of scenes. ProtoOcc consists of two main components: the Dual Branch Encoder (DBE) and the Prototype Query Decoder (PQD). The DBE produces a new 3D voxel representation by combining 3D voxel and BEV representations across multiple scales through a dual branch structure. This design enhances both performance and computational efficiency by providing a large receptive field for the BEV representation while maintaining a smaller receptive field for the voxel representation. The PQD introduces Prototype Queries to accelerate the decoding process. Scene-Adaptive Prototypes are derived from the 3D voxel features of input sample, while Scene-Agnostic Prototypes are computed by applying Scene-Adaptive Prototypes to an Exponential Moving Average during the training phase. By using these prototype-based queries for decoding, we can directly predict 3D occupancy in a single step, eliminating the need for iterative Transformer decoding. Additionally, we propose the Robust Prototype Learning, which injects noise into prototype generation process and trains the model to denoise during the training phase. ProtoOcc achieves state-of-the-art performance with 45.02% mIoU on the Occ3D-nuScenes benchmark. For single-frame method, it reaches 39.56% mIoU with an inference speed of 12.83 FPS on an NVIDIA RTX 3090. Our code can be found at https://github.com/SPA-junghokim/ProtoOcc.
Problem

Research questions and friction points this paper is trying to address.

Predict 3D voxel occupancy states
Enhance computational efficiency
Accelerate decoding with Prototype Queries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Dual Branch Encoder for 3D representation
Employs Prototype Query Decoder for efficiency
Introduces Robust Prototype Learning with noise injection
J
Jungho Kim
Seoul National University
C
Changwon Kang
Hanyang University
D
Dongyoung Lee
Hanyang University
S
Sehwan Choi
Hanyang University
Jungwook Choi
Jungwook Choi
Hanyang University
Deep Neural NetworkQuantizationLarge Language ModelEfficient AIAI Accelerator