OpenUrban3D: Annotation-Free Open-Vocabulary Semantic Segmentation of Large-Scale Urban Point Clouds

📅 2025-09-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address poor generalization, reliance on multi-view image alignment, and dependence on manual annotations in open-vocabulary semantic segmentation of large-scale urban point clouds, this paper proposes the first end-to-end zero-shot 3D segmentation framework that requires no image alignment, no pre-trained models, and no human annotations. Methodologically, it innovatively integrates multi-granularity point cloud rendering, mask-level vision-language feature extraction, sample-balanced fusion, and cross-view knowledge distillation to achieve language-guided 3D scene understanding in a fully unsupervised setting. Evaluated on urban-scale benchmarks—including SensatUrban and SUM—the framework significantly outperforms prior methods in both segmentation accuracy and cross-domain generalization. To our knowledge, this is the first work to realize fully open-vocabulary, zero-shot, alignment-free semantic segmentation on real-world city-scale point clouds. The proposed approach establishes a scalable foundational technique for digital twin construction and intelligent urban management.

Technology Category

Application Category

📝 Abstract
Open-vocabulary semantic segmentation enables models to recognize and segment objects from arbitrary natural language descriptions, offering the flexibility to handle novel, fine-grained, or functionally defined categories beyond fixed label sets. While this capability is crucial for large-scale urban point clouds that support applications such as digital twins, smart city management, and urban analytics, it remains largely unexplored in this domain. The main obstacles are the frequent absence of high-quality, well-aligned multi-view imagery in large-scale urban point cloud datasets and the poor generalization of existing three-dimensional (3D) segmentation pipelines across diverse urban environments with substantial variation in geometry, scale, and appearance. To address these challenges, we present OpenUrban3D, the first 3D open-vocabulary semantic segmentation framework for large-scale urban scenes that operates without aligned multi-view images, pre-trained point cloud segmentation networks, or manual annotations. Our approach generates robust semantic features directly from raw point clouds through multi-view, multi-granularity rendering, mask-level vision-language feature extraction, and sample-balanced fusion, followed by distillation into a 3D backbone model. This design enables zero-shot segmentation for arbitrary text queries while capturing both semantic richness and geometric priors. Extensive experiments on large-scale urban benchmarks, including SensatUrban and SUM, show that OpenUrban3D achieves significant improvements in both segmentation accuracy and cross-scene generalization over existing methods, demonstrating its potential as a flexible and scalable solution for 3D urban scene understanding.
Problem

Research questions and friction points this paper is trying to address.

Open-vocabulary semantic segmentation without aligned multi-view images
Poor generalization of 3D segmentation across diverse urban environments
Annotation-free recognition of arbitrary object categories in urban point clouds
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-view multi-granularity rendering for feature generation
Mask-level vision-language feature extraction technique
Sample-balanced fusion with 3D backbone distillation
🔎 Similar Papers
Chongyu Wang
Chongyu Wang
Florida State University
InvestmentReal EstateSustainability
K
Kunlei Jing
School of Software Engineering, Xi’an Jiaotong University, 710049 Xi’an, China; Shaanxi Joint (Key) Laboratory for Artificial Intelligence (Xi’an Jiaotong University), Xi’an 710049, China
J
Jihua Zhu
School of Software Engineering, Xi’an Jiaotong University, 710049 Xi’an, China; Shaanxi Joint (Key) Laboratory for Artificial Intelligence (Xi’an Jiaotong University), Xi’an 710049, China
D
Di Wang
School of Software Engineering, Xi’an Jiaotong University, 710049 Xi’an, China; Shaanxi Joint (Key) Laboratory for Artificial Intelligence (Xi’an Jiaotong University), Xi’an 710049, China