๐ค AI Summary
To address the challenges of LiDAR data sparsity, severe occlusion, coarse semantic granularity, and heavy reliance on manual annotation in 3D point cloud labeling, this paper proposes the first multimodal open-world automatic annotation framework integrating vision-language models (VLMs). Our method requires neither ground-truth labels nor high-definition maps; instead, it leverages LiDAR-camera cross-modal alignment and VLM-driven open-vocabulary semantic understanding to enable fine-grained object discovery and class-incremental learning. It further incorporates point-cloud-specific detection optimization and self-supervised pseudo-label refinement. Evaluated on nuScenes, our approach achieves an object discovery AP of 52.95% and a multi-class 3D detection AP of up to 46.54%, significantly improving labeling efficiency and generalization capability for large-scale point cloud annotation.
๐ Abstract
Data collection for autonomous driving is rapidly accelerating, but manual annotation, especially for 3D labels, remains a major bottleneck due to its high cost and labor intensity. Autolabeling has emerged as a scalable alternative, allowing the generation of labels for point clouds with minimal human intervention. While LiDAR-based autolabeling methods leverage geometric information, they struggle with inherent limitations of lidar data, such as sparsity, occlusions, and incomplete object observations. Furthermore, these methods typically operate in a class-agnostic manner, offering limited semantic granularity. To address these challenges, we introduce VESPA, a multimodal autolabeling pipeline that fuses the geometric precision of LiDAR with the semantic richness of camera images. Our approach leverages vision-language models (VLMs) to enable open-vocabulary object labeling and to refine detection quality directly in the point cloud domain. VESPA supports the discovery of novel categories and produces high-quality 3D pseudolabels without requiring ground-truth annotations or HD maps. On Nuscenes dataset, VESPA achieves an AP of 52.95% for object discovery and up to 46.54% for multiclass object detection, demonstrating strong performance in scalable 3D scene understanding. Code will be available upon acceptance.