🤖 AI Summary
In open-vocabulary 3D scene understanding, inconsistent cross-view segmentation granularity from SAM—e.g., “coffee set” being fragmented into “cup + coffee + spoon” across views—causes semantic fragmentation in 3D Gaussian Splatting (3DGS) object representations. To address this, we propose the first 3DGS semantic enhancement framework integrating spatial context modeling. Our method: (1) constructs a local graph structure to aggregate spatial-semantic features from neighboring Gaussians via message passing; and (2) introduces mask-centered contrastive learning to align fine-grained cross-view segmentation masks with coarse-grained semantic centers. Crucially, it avoids isolated per-Gaussian feature learning, substantially improving semantic consistency. Evaluated on LERF-OVS and ScanNet, our approach achieves significant gains in instance segmentation mAP and reduces fragmentation errors by 42.6%, marking the first demonstration of efficient, robust language-driven 3D understanding in large-scale scenes.
📝 Abstract
Open-vocabulary 3D scene understanding is crucial for applications requiring natural language-driven spatial interpretation, such as robotics and augmented reality. While 3D Gaussian Splatting (3DGS) offers a powerful representation for scene reconstruction, integrating it with open-vocabulary frameworks reveals a key challenge: cross-view granularity inconsistency. This issue, stemming from 2D segmentation methods like SAM, results in inconsistent object segmentations across views (e.g., a"coffee set"segmented as a single entity in one view but as"cup + coffee + spoon"in another). Existing 3DGS-based methods often rely on isolated per-Gaussian feature learning, neglecting the spatial context needed for cohesive object reasoning, leading to fragmented representations. We propose Context-Aware Gaussian Splatting (CAGS), a novel framework that addresses this challenge by incorporating spatial context into 3DGS. CAGS constructs local graphs to propagate contextual features across Gaussians, reducing noise from inconsistent granularity, employs mask-centric contrastive learning to smooth SAM-derived features across views, and leverages a precomputation strategy to reduce computational cost by precomputing neighborhood relationships, enabling efficient training in large-scale scenes. By integrating spatial context, CAGS significantly improves 3D instance segmentation and reduces fragmentation errors on datasets like LERF-OVS and ScanNet, enabling robust language-guided 3D scene understanding.