VESPA: Towards un(Human)supervised Open-World Pointcloud Labeling for Autonomous Driving

📅 2025-07-27

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

To address the challenges of LiDAR data sparsity, severe occlusion, coarse semantic granularity, and heavy reliance on manual annotation in 3D point cloud labeling, this paper proposes the first multimodal open-world automatic annotation framework integrating vision-language models (VLMs). Our method requires neither ground-truth labels nor high-definition maps; instead, it leverages LiDAR-camera cross-modal alignment and VLM-driven open-vocabulary semantic understanding to enable fine-grained object discovery and class-incremental learning. It further incorporates point-cloud-specific detection optimization and self-supervised pseudo-label refinement. Evaluated on nuScenes, our approach achieves an object discovery AP of 52.95% and a multi-class 3D detection AP of up to 46.54%, significantly improving labeling efficiency and generalization capability for large-scale point cloud annotation.

Technology Category

Application Category

📝 Abstract

Data collection for autonomous driving is rapidly accelerating, but manual annotation, especially for 3D labels, remains a major bottleneck due to its high cost and labor intensity. Autolabeling has emerged as a scalable alternative, allowing the generation of labels for point clouds with minimal human intervention. While LiDAR-based autolabeling methods leverage geometric information, they struggle with inherent limitations of lidar data, such as sparsity, occlusions, and incomplete object observations. Furthermore, these methods typically operate in a class-agnostic manner, offering limited semantic granularity. To address these challenges, we introduce VESPA, a multimodal autolabeling pipeline that fuses the geometric precision of LiDAR with the semantic richness of camera images. Our approach leverages vision-language models (VLMs) to enable open-vocabulary object labeling and to refine detection quality directly in the point cloud domain. VESPA supports the discovery of novel categories and produces high-quality 3D pseudolabels without requiring ground-truth annotations or HD maps. On Nuscenes dataset, VESPA achieves an AP of 52.95% for object discovery and up to 46.54% for multiclass object detection, demonstrating strong performance in scalable 3D scene understanding. Code will be available upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Addresses high cost of manual 3D annotation for autonomous driving

Overcomes LiDAR limitations like sparsity and occlusions in autolabeling

Enables open-vocabulary 3D labeling without ground-truth annotations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses LiDAR and camera for precise autolabeling

Uses vision-language models for open-vocabulary labeling

Generates high-quality 3D pseudolabels without annotations

🔎 Similar Papers

3D-AVS: LiDAR-based 3D Auto-Vocabulary Segmentation