๐ค AI Summary
This work addresses the challenge of leveraging vast unlabeled LiDAR data in autonomous driving while mitigating the high cost of manual annotation by proposing an unsupervised multimodal pseudo-labeling framework. By exploiting temporal geometric consistency across consecutive LiDAR scans, the method elevates and fuses semantic cues from 2D vision foundation models and textual prompts into 3D space. It introduces a geometry-prior-driven dynamic scene decomposition and iterative refinement mechanism, enabling joint 3D semantic segmentation, object detection, and point cloud densification within a unified framework. High-quality pseudo labels are generated through geometricโsemantic consistency constraints, leveraging temporally accumulated LiDAR maps as geometric priors combined with multimodal prompts. Evaluated on three datasets, the approach demonstrates strong generalization: without any human supervision, it reduces the mean absolute error (MAE) of depth prediction by 51.5% and 22.0% in the 80โ150 m and 150โ250 m ranges, respectively, using only sparse, geometrically consistent dense points.
๐ Abstract
Unlabeled LiDAR logs, in autonomous driving applications, are inherently a gold mine of dense 3D geometry hiding in plain sight - yet they are almost useless without human labels, highlighting a dominant cost barrier for autonomous-perception research. In this work we tackle this bottleneck by leveraging temporal-geometric consistency across LiDAR sweeps to lift and fuse cues from text and 2D vision foundation models directly into 3D, without any manual input. We introduce an unsupervised multi-modal pseudo-labeling method relying on strong geometric priors learned from temporally accumulated LiDAR maps, alongside with a novel iterative update rule that enforces joint geometric-semantic consistency, and vice-versa detecting moving objects from inconsistencies. Our method simultaneously produces 3D semantic labels, 3D bounding boxes, and dense LiDAR scans, demonstrating robust generalization across three datasets. We experimentally validate that our method compares favorably to existing semantic segmentation and object detection pseudo-labeling methods, which often require additional manual supervision. We confirm that even a small fraction of our geometrically consistent, densified LiDAR improves depth prediction by 51.5% and 22.0% MAE in the 80-150 and 150-250 meters range, respectively.