Language Driven Occupancy Prediction

📅 2024-11-25
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-vocabulary 3D occupancy (OVO) prediction suffers from coarse, sparse, and noisy voxel-text correspondence—caused by image-feature mediation or voxel-view projection—leading to inaccurate supervision. To address this, we propose LOcc, a novel framework featuring a first-of-its-kind semantic propagation annotation pipeline that enables dense cross-modal alignment from image-text → LiDAR point clouds → voxels, thereby establishing fine-grained voxel-level language supervision. LOcc decouples geometric and linguistic representations to support zero-shot open-vocabulary understanding. Technically, it integrates multimodal alignment, BEV feature encoding, voxelized language embedding, a binary occupancy geometry head, and a language feature prediction head. On Occ3D-nuScenes, LOcc achieves 20.29 mIoU using only 256×704 input resolution—outperforming state-of-the-art methods that rely on temporal fusion, higher-resolution inputs, or large foundation models—while substantially reducing annotation cost.

Technology Category

Application Category

📝 Abstract
We introduce LOcc, an effective and generalizable framework for open-vocabulary occupancy (OVO) prediction. Previous approaches typically supervise the networks through coarse voxel-to-text correspondences via image features as intermediates or noisy and sparse correspondences from voxel-based model-view projections. To alleviate the inaccurate supervision, we propose a semantic transitive labeling pipeline to generate dense and finegrained 3D language occupancy ground truth. Our pipeline presents a feasible way to dig into the valuable semantic information of images, transferring text labels from images to LiDAR point clouds and utimately to voxels, to establish precise voxel-to-text correspondences. By replacing the original prediction head of supervised occupancy models with a geometry head for binary occupancy states and a language head for language features, LOcc effectively uses the generated language ground truth to guide the learning of 3D language volume. Through extensive experiments, we demonstrate that our semantic transitive labeling pipeline can produce more accurate pseudo-labeled ground truth, diminishing labor-intensive human annotations. Additionally, we validate LOcc across various architectures, where all models consistently outperform state-ofthe-art zero-shot occupancy prediction approaches on the Occ3D-nuScenes dataset. Notably, even based on the simpler BEVDet model, with an input resolution of 256 * 704,Occ-BEVDet achieves an mIoU of 20.29, surpassing previous approaches that rely on temporal images, higher-resolution inputs, or larger backbone networks. The code for the proposed method is available at https://github.com/pkqbajng/LOcc.
Problem

Research questions and friction points this paper is trying to address.

Generating dense 3D language occupancy ground truth
Improving voxel-to-text correspondence accuracy
Reducing reliance on human annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic transitive labeling for dense 3D language occupancy
Geometry and language heads replace original prediction heads
Transfers text labels from images to LiDAR to voxels
Z
Zhu Yu
Zhejiang University
Bowen Pang
Bowen Pang
Noah Ark's Lab, Huawei
L
Lizhe Liu
Autonomous Driving Lab, Cainiao Network
R
Runmin Zhang
Zhejiang University
Q
Qihao Peng
Autonomous Driving Lab, Cainiao Network
M
Maochun Luo
Autonomous Driving Lab, Cainiao Network
S
Sheng Yang
Autonomous Driving Lab, Cainiao Network
M
Mingxia Chen
Autonomous Driving Lab, Cainiao Network
S
Sixi Cao
Zhejiang University
H
Hui Shen
Zhejiang University