🤖 AI Summary
This work addresses the limitations of existing semantic 3D Gaussian splatting methods, which rely solely on 2D foundation models and struggle to preserve sharp semantic boundaries and structural consistency in urban scenes. We propose the first approach that integrates CityGML (Level-of-Detail 3) semantic priors into the 3D Gaussian splatting framework. By employing a two-pass ray casting strategy, our method fuses 2D foundation model predictions with structured urban models to generate image-aligned hierarchical semantic masks. Coupled with 2D identity supervision and 3D spatial regularization, the model learns compact, structure-aware Gaussian point identity embeddings. Evaluated on the TUM2TWIN and Gold Coast datasets, our approach significantly outperforms baselines such as LangSplat and Gaga, achieving up to a 15.8-point improvement in coarse-grained building segmentation IoU and a 14.2-point gain in fine-grained semantic mIoU, thereby enabling high-quality, instance-consistent urban semantic reconstruction.
📝 Abstract
Recent semantic 3D Gaussian Splatting (3DGS) methods primarily rely on 2D foundation models, often yielding ambiguous boundaries and limited support for structured urban semantics. While city models such as CityGML encode hierarchically organized semantics together with building geometry, these labels cannot be directly mapped to Gaussian primitives. We present GS4City, a hierarchical semantic Gaussian Splatting method that incorporates city-model priors for urban scene understanding. GS4City derives reliable image-aligned masks from Level of Detail (LoD) 3 CityGML models via two-pass raycasting, explicitly using parent-child relations to validate and recover fine-grained facade elements. It then fuses these geometry-grounded masks with foundation-model predictions to establish scene-consistent instance correspondences, and learns a compact identity encoding for each Gaussian under joint 2D identity supervision and 3D spatial regularization. Experiments on the TUM2TWIN and Gold Coast datasets show that GS4City effectively incorporates structured building semantics into Gaussian scene representations, outperforming existing 2D-driven semantic 3DGS baselines, including LangSplat and Gaga, by up to 15.8 IoU points in coarse building segmentation and 14.2 mIoU points in fine-grained semantic segmentation. By bridging structured city models and photorealistic Gaussian scene representations, GS4City enables semantically queryable and structure-aware urban reconstruction. Code is available at https://github.com/Jinyzzz/GS4City.