DGOcc: Depth-aware Global Query-based Network for Monocular 3D Occupancy Prediction

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses large-scale outdoor 3D scene understanding from monocular images, jointly predicting voxel-wise occupancy and semantic labels. To tackle challenges including severe geometric ambiguity and high computational cost, we propose a depth-aware global query (GQ) mechanism and a hierarchical supervision strategy (HSS). GQ enables efficient cross-modal mapping from 2D image features to 3D voxels via learnable global queries, eliminating costly high-resolution voxel upsampling. HSS integrates prior depth encoding, multi-scale attention, and hierarchical losses to enhance geometric–semantic consistency. Our method achieves state-of-the-art performance on SemanticKITTI and SSCBench-KITTI-360, reducing GPU memory consumption by 32% and accelerating inference by 2.1× over prior approaches. To the best of our knowledge, it is the first monocular method to achieve both high accuracy and efficiency in large-scale 3D scene reconstruction and semantic parsing.

Technology Category

Application Category

📝 Abstract
Monocular 3D occupancy prediction, aiming to predict the occupancy and semantics within interesting regions of 3D scenes from only 2D images, has garnered increasing attention recently for its vital role in 3D scene understanding. Predicting the 3D occupancy of large-scale outdoor scenes from 2D images is ill-posed and resource-intensive. In this paper, we present extbf{DGOcc}, a extbf{D}epth-aware extbf{G}lobal query-based network for monocular 3D extbf{Occ}upancy prediction. We first explore prior depth maps to extract depth context features that provide explicit geometric information for the occupancy network. Then, in order to fully exploit the depth context features, we propose a Global Query-based (GQ) Module. The cooperation of attention mechanisms and scale-aware operations facilitates the feature interaction between images and 3D voxels. Moreover, a Hierarchical Supervision Strategy (HSS) is designed to avoid upsampling the high-dimension 3D voxel features to full resolution, which mitigates GPU memory utilization and time cost. Extensive experiments on SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate that the proposed method achieves the best performance on monocular semantic occupancy prediction while reducing GPU and time overhead.
Problem

Research questions and friction points this paper is trying to address.

Predict 3D occupancy from 2D images efficiently
Leverage depth context for geometric information
Reduce GPU memory and time costs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Depth-aware features from prior depth maps
Global Query-based Module for feature interaction
Hierarchical Supervision Strategy reduces resource usage
🔎 Similar Papers
No similar papers found.
X
Xu Zhao
State Key Laboratory of Multimodal Artificial Intelligence, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Pengju Zhang
Pengju Zhang
University of Bristol
AIBioinformaticsStatistical PhysicsFinancial Technology
B
Bo Liu
State Key Laboratory of Multimodal Artificial Intelligence, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Y
Yihong Wu
State Key Laboratory of Multimodal Artificial Intelligence, Institute of Automation, Chinese Academy of Sciences, Beijing, China