3D Vision-Language Gaussian Splatting

📅 2024-10-10
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for multimodal 3D scene understanding struggle to balance visual and linguistic modalities, leading to semantic grid distortion for transparent/refractive objects and overfitting to color cues. To address this, we propose a vision-language collaborative Gaussian splatting framework. Our approach introduces the first cross-modal rasterizer that jointly incorporates camera-view geometry and a smooth semantic indicator, explicitly enhancing language-modality representation learning within Gaussian splatting—previously unexplored. By integrating 3D Gaussian splatting, cross-modal feature alignment, semantic indicator optimization, and view-consistent synthesis, our method significantly improves open-vocabulary semantic segmentation. It achieves state-of-the-art performance, notably advancing semantic reconstruction accuracy and generalization capability for transparent and refractive objects.

Technology Category

Application Category

📝 Abstract
Recent advancements in 3D reconstruction methods and vision-language models have propelled the development of multi-modal 3D scene understanding, which has vital applications in robotics, autonomous driving, and virtual/augmented reality. However, current multi-modal scene understanding approaches have naively embedded semantic representations into 3D reconstruction methods without striking a balance between visual and language modalities, which leads to unsatisfying semantic rasterization of translucent or reflective objects, as well as over-fitting on color modality. To alleviate these limitations, we propose a solution that adequately handles the distinct visual and semantic modalities, i.e., a 3D vision-language Gaussian splatting model for scene understanding, to put emphasis on the representation learning of language modality. We propose a novel cross-modal rasterizer, using modality fusion along with a smoothed semantic indicator for enhancing semantic rasterization. We also employ a camera-view blending technique to improve semantic consistency between existing and synthesized views, thereby effectively mitigating over-fitting. Extensive experiments demonstrate that our method achieves state-of-the-art performance in open-vocabulary semantic segmentation, surpassing existing methods by a significant margin.
Problem

Research questions and friction points this paper is trying to address.

Balancing visual and language modalities in 3D reconstruction
Improving semantic rasterization for translucent or reflective objects
Mitigating over-fitting on color modality in multi-modal 3D understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D vision-language Gaussian splatting model
Cross-modal rasterizer with modality fusion
Camera-view blending for semantic consistency
🔎 Similar Papers
No similar papers found.