🤖 AI Summary
Addressing two key challenges in open-vocabulary semantic segmentation for 3D Gaussian Splatting—(i) language feature contamination caused by redundant background Gaussians and (ii) multi-view inconsistency induced by view-specific noise—this paper proposes a visibility-aware language feature fusion method. Our approach features: (1) a ray-visibility-based gating mechanism that dynamically suppresses linguistic responses from low-contribution Gaussians, and (2) streaming weighted geometric median fusion in cosine space to enhance cross-view consistency of language features. The method is lightweight and training-free, requiring no auxiliary networks or additional supervision. Evaluated on multiple open-vocabulary localization and segmentation benchmarks, it significantly outperforms existing state-of-the-art methods, achieving superior accuracy, robustness against viewpoint and occlusion variations, and real-time inference speed.
📝 Abstract
Recently, distilling open-vocabulary language features from 2D images into 3D Gaussians has attracted significant attention. Although existing methods achieve impressive language-based interactions of 3D scenes, we observe two fundamental issues: background Gaussians contributing negligibly to a rendered pixel get the same feature as the dominant foreground ones, and multi-view inconsistencies due to view-specific noise in language embeddings. We introduce Visibility-Aware Language Aggregation (VALA), a lightweight yet effective method that computes marginal contributions for each ray and applies a visibility-aware gate to retain only visible Gaussians. Moreover, we propose a streaming weighted geometric median in cosine space to merge noisy multi-view features. Our method yields a robust, view-consistent language feature embedding in a fast and memory-efficient manner. VALA improves open-vocabulary localization and segmentation across reference datasets, consistently surpassing existing works.