L2COcc: Lightweight Camera-Centric Semantic Scene Completion via Distillation of LiDAR Model

📅 2025-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and memory footprint of semantic scene completion (SSC) in autonomous driving, this paper proposes a lightweight vision-dominated framework augmented with LiDAR priors. Our method introduces three key innovations: (1) an Efficient Voxel Transformer (EVT) that drastically reduces 3D voxel modeling complexity; (2) a triple cross-modal knowledge distillation scheme—comprising Feature-Space Distillation (FSD), Target-Prior Voxel Distillation (TPVD), and Prediction-Aware Distillation (PAD)—enabling high-fidelity knowledge transfer from a strong LiDAR teacher to a lightweight camera-based student; and (3) multimodal feature alignment coupled with a lightweight 3D encoding strategy. Evaluated on SemanticKITTI and SSCBench-KITTI-360, our approach achieves state-of-the-art accuracy while reducing inference latency and memory consumption by over 23%, striking an optimal balance between efficiency and performance.

Technology Category

Application Category

📝 Abstract
Semantic Scene Completion (SSC) constitutes a pivotal element in autonomous driving perception systems, tasked with inferring the 3D semantic occupancy of a scene from sensory data. To improve accuracy, prior research has implemented various computationally demanding and memory-intensive 3D operations, imposing significant computational requirements on the platform during training and testing. This paper proposes L2COcc, a lightweight camera-centric SSC framework that also accommodates LiDAR inputs. With our proposed efficient voxel transformer (EVT) and cross-modal knowledge modules, including feature similarity distillation (FSD), TPV distillation (TPVD) and prediction alignment distillation (PAD), our method substantially reduce computational burden while maintaining high accuracy. The experimental evaluations demonstrate that our proposed method surpasses the current state-of-the-art vision-based SSC methods regarding accuracy on both the SemanticKITTI and SSCBench-KITTI-360 benchmarks, respectively. Additionally, our method is more lightweight, exhibiting a reduction in both memory consumption and inference time by over 23% compared to the current state-of-the-arts method. Code is available at our project page:https://studyingfufu.github.io/L2COcc/.
Problem

Research questions and friction points this paper is trying to address.

Improves Semantic Scene Completion accuracy for autonomous driving.
Reduces computational burden and memory usage in SSC frameworks.
Integrates LiDAR inputs with lightweight camera-centric SSC methods.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight camera-centric SSC framework
Efficient voxel transformer (EVT) implementation
Cross-modal knowledge distillation modules
🔎 Similar Papers
No similar papers found.
R
Ruoyu Wang
Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, China
Yukai Ma
Yukai Ma
Zhejiang University
Y
Yi Yao
Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, China
S
Sheng Tao
Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, China
Haoang Li
Haoang Li
Assistant Professor, Hong Kong University of Science and Technology (Guangzhou)
Robotics3D Computer Vision
Z
Zongzhi Zhu
Zhejiang Guoli Security Technology Co., Ltd., Ningbo, China
Y
Yong Liu
State Key Laboratory of Industrial Control Technology
Xingxing Zuo
Xingxing Zuo
Assistant Professor @MBZUAI
RoboticsState EstimationEmbodied AI