OpenUrban3D: Annotation-Free Open-Vocabulary Semantic Segmentation of Large-Scale Urban Point Clouds

📅 2025-09-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address poor generalization, reliance on multi-view image alignment, and dependence on manual annotations in open-vocabulary semantic segmentation of large-scale urban point clouds, this paper proposes the first end-to-end zero-shot 3D segmentation framework that requires no image alignment, no pre-trained models, and no human annotations. Methodologically, it innovatively integrates multi-granularity point cloud rendering, mask-level vision-language feature extraction, sample-balanced fusion, and cross-view knowledge distillation to achieve language-guided 3D scene understanding in a fully unsupervised setting. Evaluated on urban-scale benchmarks—including SensatUrban and SUM—the framework significantly outperforms prior methods in both segmentation accuracy and cross-domain generalization. To our knowledge, this is the first work to realize fully open-vocabulary, zero-shot, alignment-free semantic segmentation on real-world city-scale point clouds. The proposed approach establishes a scalable foundational technique for digital twin construction and intelligent urban management.

Technology Category

Application Category

📝 Abstract

Open-vocabulary semantic segmentation enables models to recognize and segment objects from arbitrary natural language descriptions, offering the flexibility to handle novel, fine-grained, or functionally defined categories beyond fixed label sets. While this capability is crucial for large-scale urban point clouds that support applications such as digital twins, smart city management, and urban analytics, it remains largely unexplored in this domain. The main obstacles are the frequent absence of high-quality, well-aligned multi-view imagery in large-scale urban point cloud datasets and the poor generalization of existing three-dimensional (3D) segmentation pipelines across diverse urban environments with substantial variation in geometry, scale, and appearance. To address these challenges, we present OpenUrban3D, the first 3D open-vocabulary semantic segmentation framework for large-scale urban scenes that operates without aligned multi-view images, pre-trained point cloud segmentation networks, or manual annotations. Our approach generates robust semantic features directly from raw point clouds through multi-view, multi-granularity rendering, mask-level vision-language feature extraction, and sample-balanced fusion, followed by distillation into a 3D backbone model. This design enables zero-shot segmentation for arbitrary text queries while capturing both semantic richness and geometric priors. Extensive experiments on large-scale urban benchmarks, including SensatUrban and SUM, show that OpenUrban3D achieves significant improvements in both segmentation accuracy and cross-scene generalization over existing methods, demonstrating its potential as a flexible and scalable solution for 3D urban scene understanding.

Problem

Research questions and friction points this paper is trying to address.

Open-vocabulary semantic segmentation without aligned multi-view images

Poor generalization of 3D segmentation across diverse urban environments

Annotation-free recognition of arbitrary object categories in urban point clouds

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-view multi-granularity rendering for feature generation

Mask-level vision-language feature extraction technique

Sample-balanced fusion with 3D backbone distillation

🔎 Similar Papers

3D-AVS: LiDAR-based 3D Auto-Vocabulary Segmentation

2024-06-13Citations: 0

Authors to Follow