3DAxisPrompt: Promoting the 3D Grounding and Reasoning in GPT-4o

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) exhibit limited 3D visual understanding in real-world scenarios. Method: We propose 3DAxisPrompt—a novel visual prompting framework that jointly integrates 3D coordinate axes and SAM-generated masks to explicitly encode geometric priors into GPT-4o, thereby extending its 2D localization and reasoning capabilities to 3D space. This work constitutes the first systematic exploration of 3D visual prompting paradigms for MLLMs. Contribution/Results: We conduct comprehensive quantitative and qualitative evaluations across four benchmark 3D datasets—ScanRefer, ScanNet, FMB, and nuScenes—demonstrating substantial improvements in object-level 3D spatial perception. Our analysis reveals both the latent potential and inherent limitations of GPT-4o in 3D grounding and reasoning, uncovering task-specific efficacy of prompting strategies. These findings provide critical empirical evidence for future 3D prompt engineering and highlight the need for adaptive, task-aware prompting designs in 3D vision–language modeling.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) exhibit impressive capabilities across a variety of tasks, especially when equipped with carefully designed visual prompts. However, existing studies primarily focus on logical reasoning and visual understanding, while the capability of MLLMs to operate effectively in 3D vision remains an ongoing area of exploration. In this paper, we introduce a novel visual prompting method, called 3DAxisPrompt, to elicit the 3D understanding capabilities of MLLMs in real-world scenes. More specifically, our method leverages the 3D coordinate axis and masks generated from the Segment Anything Model (SAM) to provide explicit geometric priors to MLLMs and then extend their impressive 2D grounding and reasoning ability to real-world 3D scenarios. Besides, we first provide a thorough investigation of the potential visual prompting formats and conclude our findings to reveal the potential and limits of 3D understanding capabilities in GPT-4o, as a representative of MLLMs. Finally, we build evaluation environments with four datasets, i.e., ScanRefer, ScanNet, FMB, and nuScene datasets, covering various 3D tasks. Based on this, we conduct extensive quantitative and qualitative experiments, which demonstrate the effectiveness of the proposed method. Overall, our study reveals that MLLMs, with the help of 3DAxisPrompt, can effectively perceive an object's 3D position in real-world scenarios. Nevertheless, a single prompt engineering approach does not consistently achieve the best outcomes for all 3D tasks. This study highlights the feasibility of leveraging MLLMs for 3D vision grounding/reasoning with prompt engineering techniques.
Problem

Research questions and friction points this paper is trying to address.

Enhancing 3D grounding and reasoning in GPT-4o using 3DAxisPrompt.
Exploring MLLMs' 3D understanding capabilities in real-world scenarios.
Evaluating 3D vision tasks with datasets and prompt engineering techniques.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces 3DAxisPrompt for 3D understanding in MLLMs.
Uses 3D coordinate axis and SAM masks for geometric priors.
Evaluates 3D tasks using multiple datasets and experiments.
🔎 Similar Papers
No similar papers found.