🤖 AI Summary
This work proposes the first embodied reasoning framework that integrates Vision-Language Models (VLMs) with 3D Gaussian Splatting (3DGS) to address the limitations of existing methods in handling complex compositional language queries and supporting effective embodied exploration. By leveraging VLMs to drive active viewpoint adjustment and semantic understanding, the framework dynamically generates novel viewpoints and performs spatial reasoning tailored to intricate language instructions. The approach synergistically combines 3DGS, VLMs, relevant image retrieval, and novel view synthesis, achieving significant performance gains over current state-of-the-art methods across multiple benchmarks. These results demonstrate the effectiveness and potential of jointly harnessing VLMs and 3DGS for embodied intelligence tasks requiring sophisticated language grounding and 3D scene interaction.
📝 Abstract
We present GaussExplorer, a framework for embodied exploration and reasoning built on 3D Gaussian Splatting (3DGS). While prior approaches to language-embedded 3DGS have made meaningful progress in aligning simple text queries with Gaussian embeddings, they are generally optimized for relatively simple queries and struggle to interpret more complex, compositional language queries. Alternative studies based on object-centric RGB-D structured memories provide spatial grounding but are constrained by pre-fixed viewpoints. To address these issues, GaussExplorer introduces Vision-Language Models (VLMs) on top of 3DGS to enable question-driven exploration and reasoning within 3D scenes. We first identify pre-captured images that are most correlated with the query question, and subsequently adjust them into novel viewpoints to more accurately capture visual information for better reasoning by VLMs. Experiments show that ours outperforms existing methods on several benchmarks, demonstrating the effectiveness of integrating VLM-based reasoning with 3DGS for embodied tasks.