GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D vision-language models (VLMs) heavily rely on off-the-shelf 3D object detectors, introducing computational bottlenecks and limiting classification flexibility. To address this, we propose the first end-to-end 3D vision-language modeling framework built upon 3D Gaussian splatting: language features are directly embedded into Gaussian primitives to enable early cross-modal alignment; a dual sparsification mechanism—guided by both task semantics and spatial position—generates task-aware global-local scene tokens; and a multimodal feature fusion network enhances 3D semantic reasoning. Our approach eliminates detector dependency entirely, significantly improving open-domain generalization and fine-grained spatial-semantic alignment accuracy. Under cross-domain evaluation, it achieves five times the performance of current state-of-the-art 3D VLMs and robustly supports complex embodied reasoning tasks.

Technology Category

Application Category

📝 Abstract
As multimodal language models advance, their application to 3D scene understanding is a fast-growing frontier, driving the development of 3D Vision-Language Models (VLMs). Current methods show strong dependence on object detectors, introducing processing bottlenecks and limitations in taxonomic flexibility. To address these limitations, we propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations. Our approach directly embeds rich linguistic features into the 3D scene representation by associating language with each Gaussian primitive, achieving early modality alignment. To process the resulting dense representations, we introduce a dual sparsifier that distills them into compact, task-relevant tokens via task-guided and location-guided pathways, producing sparse, task-aware global and local scene tokens. Notably, we present the first Gaussian splatting-based VLM, leveraging photorealistic 3D representations derived from standard RGB images, demonstrating strong generalization: it improves performance of prior 3D VLM five folds, in out-of-the-domain settings.
Problem

Research questions and friction points this paper is trying to address.

Reducing dependence on object detectors in 3D VLMs
Enhancing taxonomic flexibility in 3D scene understanding
Improving generalization in out-of-domain 3D VLM tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-aligned Gaussian splats for 3D scenes
Dual sparsifier for task-relevant token distillation
First Gaussian splatting-based VLM for generalization
🔎 Similar Papers
No similar papers found.