Solving Zero-Shot 3D Visual Grounding as Constraint Satisfaction Problems

📅 2024-11-21

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 1

career value

170K/year

🤖 AI Summary

This paper addresses zero-shot 3D visual grounding—localizing objects in unseen 3D scenes solely from natural language descriptions, without any supervised training. We propose the first framework that formalizes this task as a Constraint Satisfaction Problem (CSP). Our method leverages open-source large language models to parse linguistic queries and explicitly encodes global geometric and semantic constraints between the target object and spatial anchors—including negation, cardinality, and compositional relations—enabling joint symbolic reasoning via a CSP solver. Unlike existing zero-shot approaches, ours transcends closed-vocabulary assumptions and local relational modeling, achieving end-to-end, interpretable spatial reasoning. On ScanRefer and Nr3D benchmarks, our method achieves absolute improvements of +7.0% and +11.2% in Acc@0.5, respectively, surpassing state-of-the-art methods. The code is publicly available.

Technology Category

Application Category

📝 Abstract

3D visual grounding (3DVG) aims to locate objects in a 3D scene with natural language descriptions. Supervised methods have achieved decent accuracy, but have a closed vocabulary and limited language understanding ability. Zero-shot methods mostly utilize large language models (LLMs) to handle natural language descriptions, yet suffer from slow inference speed. To address these problems, in this work, we propose a zero-shot method that reformulates the 3DVG task as a Constraint Satisfaction Problem (CSP), where the variables and constraints represent objects and their spatial relations, respectively. This allows a global reasoning of all relevant objects, producing grounding results of both the target and anchor objects. Moreover, we demonstrate the flexibility of our framework by handling negation- and counting-based queries with only minor extra coding efforts. Our system, Constraint Satisfaction Visual Grounding (CSVG), has been extensively evaluated on the public datasets ScanRefer and Nr3D datasets using only open-source LLMs. Results show the effectiveness of CSVG and superior grounding accuracy over current state-of-the-art zero-shot 3DVG methods with improvements of $+7.0%$ (Acc@0.5 score) and $+11.2%$ on the ScanRefer and Nr3D datasets, respectively. The code of our system is publicly available at https://github.com/sunsleaf/CSVG.

Problem

Research questions and friction points this paper is trying to address.

Zero-shot 3D visual grounding with natural language descriptions

Reformulating 3DVG as Constraint Satisfaction Problem for global reasoning

Handling negation and counting queries with minimal extra coding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulates 3DVG as Constraint Satisfaction Problem

Uses open-source LLMs for global symbolic reasoning

Handles negation and counting with minimal coding

🔎 Similar Papers

Task-oriented Sequential Grounding and Navigation in 3D Scenes