CAGS: Open-Vocabulary 3D Scene Understanding with Context-Aware Gaussian Splatting

📅 2025-04-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In open-vocabulary 3D scene understanding, inconsistent cross-view segmentation granularity from SAM—e.g., “coffee set” being fragmented into “cup + coffee + spoon” across views—causes semantic fragmentation in 3D Gaussian Splatting (3DGS) object representations. To address this, we propose the first 3DGS semantic enhancement framework integrating spatial context modeling. Our method: (1) constructs a local graph structure to aggregate spatial-semantic features from neighboring Gaussians via message passing; and (2) introduces mask-centered contrastive learning to align fine-grained cross-view segmentation masks with coarse-grained semantic centers. Crucially, it avoids isolated per-Gaussian feature learning, substantially improving semantic consistency. Evaluated on LERF-OVS and ScanNet, our approach achieves significant gains in instance segmentation mAP and reduces fragmentation errors by 42.6%, marking the first demonstration of efficient, robust language-driven 3D understanding in large-scale scenes.

Technology Category

Application Category

📝 Abstract
Open-vocabulary 3D scene understanding is crucial for applications requiring natural language-driven spatial interpretation, such as robotics and augmented reality. While 3D Gaussian Splatting (3DGS) offers a powerful representation for scene reconstruction, integrating it with open-vocabulary frameworks reveals a key challenge: cross-view granularity inconsistency. This issue, stemming from 2D segmentation methods like SAM, results in inconsistent object segmentations across views (e.g., a"coffee set"segmented as a single entity in one view but as"cup + coffee + spoon"in another). Existing 3DGS-based methods often rely on isolated per-Gaussian feature learning, neglecting the spatial context needed for cohesive object reasoning, leading to fragmented representations. We propose Context-Aware Gaussian Splatting (CAGS), a novel framework that addresses this challenge by incorporating spatial context into 3DGS. CAGS constructs local graphs to propagate contextual features across Gaussians, reducing noise from inconsistent granularity, employs mask-centric contrastive learning to smooth SAM-derived features across views, and leverages a precomputation strategy to reduce computational cost by precomputing neighborhood relationships, enabling efficient training in large-scale scenes. By integrating spatial context, CAGS significantly improves 3D instance segmentation and reduces fragmentation errors on datasets like LERF-OVS and ScanNet, enabling robust language-guided 3D scene understanding.
Problem

Research questions and friction points this paper is trying to address.

Addresses cross-view granularity inconsistency in 3D scene understanding
Improves object cohesion by integrating spatial context into 3DGS
Reduces fragmentation errors in open-vocabulary 3D instance segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Local graphs propagate contextual features across Gaussians
Mask-centric contrastive learning smooths SAM-derived features
Precomputation strategy reduces computational cost efficiently
🔎 Similar Papers
No similar papers found.
W
Wei Sun
University of Chinese Academy of Sciences, Beijing, China
Yanzhao Zhou
Yanzhao Zhou
University of Chinese Academy of Sciences, Beijing, China
Jianbin Jiao
Jianbin Jiao
University of Chinese Academy of Sciences
Computer Vision
Y
Yuan Li
University of Chinese Academy of Sciences, Beijing, China