Training-Free Hierarchical Scene Understanding for Gaussian Splatting with Superpoint Graphs

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D Gaussian Splatting (3DGS) methods rely on per-view 2D semantic feature optimization, resulting in low efficiency and cross-view semantic inconsistency. To address this, we propose a training-free 3D Gaussian rasterization framework for scene understanding. Our method introduces the first Gaussian-primitive-based Superpoint Graph construction paradigm, enabling view-consistent 3D region partitioning. We further design a graph-structured guidance mechanism for 2D→3D semantic reprojection, ensuring geometric and semantic alignment. Within a unified semantic field, the framework supports open-vocabulary understanding—from coarse to fine granularity—by integrating vision-language models (e.g., CLIP) with multi-view geometric constraints, without fine-tuning or iterative optimization. Experiments demonstrate state-of-the-art performance on open-vocabulary 3D segmentation, achieving a 30× speedup in semantic field reconstruction while significantly improving 3D semantic consistency and hierarchical representational capacity.

Technology Category

Application Category

📝 Abstract
Bridging natural language and 3D geometry is a crucial step toward flexible, language-driven scene understanding. While recent advances in 3D Gaussian Splatting (3DGS) have enabled fast and high-quality scene reconstruction, research has also explored incorporating open-vocabulary understanding into 3DGS. However, most existing methods require iterative optimization over per-view 2D semantic feature maps, which not only results in inefficiencies but also leads to inconsistent 3D semantics across views. To address these limitations, we introduce a training-free framework that constructs a superpoint graph directly from Gaussian primitives. The superpoint graph partitions the scene into spatially compact and semantically coherent regions, forming view-consistent 3D entities and providing a structured foundation for open-vocabulary understanding. Based on the graph structure, we design an efficient reprojection strategy that lifts 2D semantic features onto the superpoints, avoiding costly multi-view iterative training. The resulting representation ensures strong 3D semantic coherence and naturally supports hierarchical understanding, enabling both coarse- and fine-grained open-vocabulary perception within a unified semantic field. Extensive experiments demonstrate that our method achieves state-of-the-art open-vocabulary segmentation performance, with semantic field reconstruction completed over $30 imes$ faster. Our code will be available at https://github.com/Atrovast/THGS.
Problem

Research questions and friction points this paper is trying to address.

Bridging natural language and 3D geometry for scene understanding
Achieving view-consistent 3D semantics without iterative optimization
Enabling efficient hierarchical open-vocabulary perception in 3D scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free superpoint graph from Gaussian primitives
Efficient reprojection for 3D semantic coherence
Hierarchical open-vocabulary understanding in unified field
🔎 Similar Papers
No similar papers found.
Shaohui Dai
Shaohui Dai
Graduate Student, Xiamen University
3D Scene Understanding
Yansong Qu
Yansong Qu
Purdue University-West Lafayette
Intelligent TransportationAutonomous Driving
Z
Zheyan Li
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China
X
Xinyang Li
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China
Shengchuan Zhang
Shengchuan Zhang
Xiamen University
computer visionmachine learning
L
Liujuan Cao
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China