Dual Enhancement on 3D Vision-Language Perception for Monocular 3D Visual Grounding

📅 2025-08-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In monocular 3D visual grounding, pretrained language models suffer from numerical sensitivity and weak 3D geometric understanding due to neglecting measurement units. To address this, we propose a dual-path enhancement framework: (1) a 3D text augmentation preprocessing step that explicitly enforces unit consistency in distance descriptions; and (2) a text-guided geometric enhancement module that achieves cross-modal alignment via unit-aware robust training and text-to-geometry consistent space projection. We further incorporate attention mechanisms for refined feature guidance and integrate text diversity augmentation. Evaluated on the Mono3DRefer benchmark, our method improves long-range localization accuracy by 11.94%, achieving new state-of-the-art performance. This work is the first to introduce unit-aware modeling into 3D visual grounding, significantly strengthening joint language–geometry representation learning.

Technology Category

Application Category

📝 Abstract
Monocular 3D visual grounding is a novel task that aims to locate 3D objects in RGB images using text descriptions with explicit geometry information. Despite the inclusion of geometry details in the text, we observe that the text embeddings are sensitive to the magnitude of numerical values but largely ignore the associated measurement units. For example, simply equidistant mapping the length with unit "meter" to "decimeters" or "centimeters" leads to severe performance degradation, even though the physical length remains equivalent. This observation signifies the weak 3D comprehension of pre-trained language model, which generates misguiding text features to hinder 3D perception. Therefore, we propose to enhance the 3D perception of model on text embeddings and geometry features with two simple and effective methods. Firstly, we introduce a pre-processing method named 3D-text Enhancement (3DTE), which enhances the comprehension of mapping relationships between different units by augmenting the diversity of distance descriptors in text queries. Next, we propose a Text-Guided Geometry Enhancement (TGE) module to further enhance the 3D-text information by projecting the basic text features into geometrically consistent space. These 3D-enhanced text features are then leveraged to precisely guide the attention of geometry features. We evaluate the proposed method through extensive comparisons and ablation studies on the Mono3DRefer dataset. Experimental results demonstrate substantial improvements over previous methods, achieving new state-of-the-art results with a notable accuracy gain of 11.94% in the "Far" scenario. Our code will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Addresses weak 3D comprehension in pre-trained language models for visual grounding
Solves text embedding sensitivity to numerical units rather than physical equivalence
Enhances 3D perception by aligning text features with geometric consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D-text Enhancement for unit diversity
Text-guided geometry feature projection
Geometrically consistent attention mechanism
🔎 Similar Papers
No similar papers found.
Y
Yuzhen Li
School of Artificial Intelligence and Robotics, Hunan University, Changsha, Hunan, China
M
Min Liu
School of Artificial Intelligence and Robotics, Hunan University, Changsha, Hunan, China
Y
Yuan Bian
School of Artificial Intelligence and Robotics, Hunan University, Changsha, Hunan, China
Xueping Wang
Xueping Wang
Hunan Normal University
computer vision
Zhaoyang Li
Zhaoyang Li
Ph.D student, University of Science and Technology of China
Computer Vision
G
Gen Li
School of Informatics, University of Edinburgh, Edinburgh, United Kingdom
Y
Yaonan Wang
School of Artificial Intelligence and Robotics, Hunan University, Changsha, Hunan, China