🤖 AI Summary
To address two key challenges in vision-only 3D occupancy estimation— inaccurate depth modeling and poor generalization due to sparse LiDAR supervision—this paper proposes a tightly integrated implicit-explicit 3D occupancy network. Methodologically: (1) it jointly leverages lift-based explicit depth prediction and projection-based implicit Transformers to enhance geometric consistency in 2D-to-3D view transformation; (2) it introduces a masked encoder-decoder architecture to improve fine-grained semantic discrimination; and (3) it incorporates context-aware self-supervised losses based on depth re-rendering and image reconstruction, enabling dense, annotation-free depth supervision. Employing a lightweight image backbone, the method achieves state-of-the-art performance on Occ3D-nuScenes at the lowest resolution, with a 3.3% absolute improvement in mean Intersection-over-Union (mIoU) over the baseline.
📝 Abstract
3D occupancy perception holds a pivotal role in recent vision-centric autonomous driving systems by converting surround-view images into integrated geometric and semantic representations within dense 3D grids. Nevertheless, current models still encounter two main challenges: modeling depth accurately in the 2D-3D view transformation stage, and overcoming the lack of generalizability issues due to sparse LiDAR supervision. To address these issues, this paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception. Our approach is three-fold: 1) Integration of explicit lift-based depth prediction and implicit projection-based transformers for depth modeling, enhancing the density and robustness of view transformation. 2) Utilization of mask-based encoder-decoder architecture for fine-grained semantic predictions; 3) Adoption of context-aware self-training loss functions in the pertaining stage to complement LiDAR supervision, involving the re-rendering of 2D depth maps from 3D occupancy features and leveraging image reconstruction loss to obtain denser depth supervision besides sparse LiDAR ground-truths. Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone compared with current models, marking an improvement of 3.3% due to our proposed contributions. Comprehensive experimentation also demonstrates the consistent superiority of our method over baselines and alternative approaches.