ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning

📅 2025-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak semantic generalization, insufficient commonsense reasoning, and poor occlusion robustness in open-vocabulary 3D visual grounding—largely stemming from reliance on 3D annotation-based fine-tuning—this paper proposes an LVLM-guided hierarchical 3D Gaussian feature splatting framework. Our method introduces: (1) the first LVLM-driven explicit-implicit language co-localization mechanism; (2) a physically scaled adaptive Gaussian grouping strategy to achieve cross-scale geometric-semantic alignment; and (3) ReasoningGD, the first large-scale occlusion-aware open-vocabulary 3D grounding dataset (10K scenes, 2M annotations). Evaluated on real-world occluded scenes, our approach significantly improves amodal localization accuracy and zero-shot category generalization. It achieves substantial performance gains over state-of-the-art methods on complex compositional reasoning tasks, demonstrating superior robustness and generalizability.

Technology Category

Application Category

📝 Abstract
Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions, even when they are occluded. This ability is crucial for tasks such as vision-language navigation and autonomous robotics. However, current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals, which limits their ability to handle diverse semantics and common knowledge required for effective reasoning. In this work, we propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping based on physical scale, enabling open-vocabulary 3D grounding and reasoning. ReasonGrounder interprets implicit instructions using large vision-language models (LVLM) and localizes occluded objects through 3D Gaussian splatting. By incorporating 2D segmentation masks from the SAM and multi-view CLIP embeddings, ReasonGrounder selects Gaussian groups based on object scale, enabling accurate localization through both explicit and implicit language understanding, even in novel, occluded views. We also contribute ReasoningGD, a new dataset containing over 10K scenes and 2 million annotations for evaluating open-vocabulary 3D grounding and amodal perception under occlusion. Experiments show that ReasonGrounder significantly improves 3D grounding accuracy in real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

Localizing objects in 3D scenes using implicit language descriptions
Handling diverse semantics and common knowledge for effective reasoning
Improving 3D grounding accuracy in occluded and novel views
Innovation

Methods, ideas, or system contributions that make the work stand out.

LVLM-guided hierarchical 3D feature fields
3D Gaussian splatting for occluded objects
SAM and CLIP embeddings for scale-based grouping
🔎 Similar Papers
No similar papers found.