DSM: Building A Diverse Semantic Map for 3D Visual Grounding

📅 2025-04-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D visual grounding methods over-rely on geometric and visual features while neglecting multidimensional semantic information—such as appearance, physical attributes, and functional affordances—leading to insufficient semantic understanding and limited generalization. To address this, we propose the Diversified Semantic Map (DSM), the first framework to introduce a geometric sliding-window construction strategy and a novel DSM-Grounding paradigm that explicitly models inter-object semantic relationships, moving beyond purely geometric or visual matching. Our method integrates multimodal vision-language models (VLMs), 3D scene graph modeling, sliding-window semantic mapping, semantic segmentation, and cross-modal alignment. On 3D visual grounding and semantic segmentation benchmarks, DSM achieves state-of-the-art performance. Real-world robotic experiments demonstrate substantial improvements in robustness and cross-scene generalization for navigation and grasping tasks.

Technology Category

Application Category

📝 Abstract
In recent years, with the growing research and application of multimodal large language models (VLMs) in robotics, there has been an increasing trend of utilizing VLMs for robotic scene understanding tasks. Existing approaches that use VLMs for 3D Visual Grounding tasks often focus on obtaining scene information through geometric and visual information, overlooking the extraction of diverse semantic information from the scene and the understanding of rich implicit semantic attributes, such as appearance, physics, and affordance. The 3D scene graph, which combines geometry and language, is an ideal representation method for environmental perception and is an effective carrier for language models in 3D Visual Grounding tasks. To address these issues, we propose a diverse semantic map construction method specifically designed for robotic agents performing 3D Visual Grounding tasks. This method leverages VLMs to capture the latent semantic attributes and relations of objects within the scene and creates a Diverse Semantic Map (DSM) through a geometry sliding-window map construction strategy. We enhance the understanding of grounding information based on DSM and introduce a novel approach named DSM-Grounding. Experimental results show that our method outperforms current approaches in tasks like semantic segmentation and 3D Visual Grounding, particularly excelling in overall metrics compared to the state-of-the-art. In addition, we have deployed this method on robots to validate its effectiveness in navigation and grasping tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhancing 3D scene understanding with diverse semantic attributes
Improving 3D Visual Grounding via latent semantic extraction
Building a geometry-language fused map for robotic tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages VLMs for diverse semantic attribute extraction
Uses geometry sliding-window for Diverse Semantic Map
Introduces DSM-Grounding for enhanced 3D Visual Grounding
🔎 Similar Papers
No similar papers found.
Qinghongbing Xie
Qinghongbing Xie
Tsinghua University
MLLMEmbodied AIScene Understanding
Z
Zijian Liang
School of Mechanical and Automotive Engineering, South China University of Technology, Guangzhou, 510641, China, intern at Tsinghua University
L
Long Zeng
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China