🤖 AI Summary
Existing open-vocabulary 3D occupancy (OVO) prediction suffers from coarse, sparse, and noisy voxel-text correspondence—caused by image-feature mediation or voxel-view projection—leading to inaccurate supervision. To address this, we propose LOcc, a novel framework featuring a first-of-its-kind semantic propagation annotation pipeline that enables dense cross-modal alignment from image-text → LiDAR point clouds → voxels, thereby establishing fine-grained voxel-level language supervision. LOcc decouples geometric and linguistic representations to support zero-shot open-vocabulary understanding. Technically, it integrates multimodal alignment, BEV feature encoding, voxelized language embedding, a binary occupancy geometry head, and a language feature prediction head. On Occ3D-nuScenes, LOcc achieves 20.29 mIoU using only 256×704 input resolution—outperforming state-of-the-art methods that rely on temporal fusion, higher-resolution inputs, or large foundation models—while substantially reducing annotation cost.
📝 Abstract
We introduce LOcc, an effective and generalizable framework for open-vocabulary occupancy (OVO) prediction. Previous approaches typically supervise the networks through coarse voxel-to-text correspondences via image features as intermediates or noisy and sparse correspondences from voxel-based model-view projections. To alleviate the inaccurate supervision, we propose a semantic transitive labeling pipeline to generate dense and finegrained 3D language occupancy ground truth. Our pipeline presents a feasible way to dig into the valuable semantic information of images, transferring text labels from images to LiDAR point clouds and utimately to voxels, to establish precise voxel-to-text correspondences. By replacing the original prediction head of supervised occupancy models with a geometry head for binary occupancy states and a language head for language features, LOcc effectively uses the generated language ground truth to guide the learning of 3D language volume. Through extensive experiments, we demonstrate that our semantic transitive labeling pipeline can produce more accurate pseudo-labeled ground truth, diminishing labor-intensive human annotations. Additionally, we validate LOcc across various architectures, where all models consistently outperform state-ofthe-art zero-shot occupancy prediction approaches on the Occ3D-nuScenes dataset. Notably, even based on the simpler BEVDet model, with an input resolution of 256 * 704,Occ-BEVDet achieves an mIoU of 20.29, surpassing previous approaches that rely on temporal images, higher-resolution inputs, or larger backbone networks. The code for the proposed method is available at https://github.com/pkqbajng/LOcc.