LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Localizing natural language referring expressions (e.g., “the black car on the right”) in outdoor autonomous driving scenes is challenging due to high scene dynamics, frequent visual ambiguity among objects, and imprecise spatial descriptions. Method: We propose a hybrid reasoning framework that synergistically integrates large language models (LLMs) and vision-language models (VLMs). It combines VLMs’ fine-grained attribute extraction with LLMs’ chain-of-symbolic reasoning and—novelty introduced herein—incorporates 3D spatial metadata to construct structured, zero-shot prompts. Geometric priors are explicitly embedded into the multimodal reasoning pipeline, eliminating reliance on end-to-end training. Contribution/Results: Our approach achieves significant improvements over standalone LLM or VLM baselines on the Talk2Car benchmark, markedly increasing localization accuracy. This validates the effectiveness of jointly leveraging spatial awareness and symbolic reasoning for robust referring expression grounding in complex driving environments.

Technology Category

Application Category

📝 Abstract

Referential grounding in outdoor driving scenes is challenging due to large scene variability, many visually similar objects, and dynamic elements that complicate resolving natural-language references (e.g., "the black car on the right"). We propose LLM-RG, a hybrid pipeline that combines off-the-shelf vision-language models for fine-grained attribute extraction with large language models for symbolic reasoning. LLM-RG processes an image and a free-form referring expression by using an LLM to extract relevant object types and attributes, detecting candidate regions, generating rich visual descriptors with a VLM, and then combining these descriptors with spatial metadata into natural-language prompts that are input to an LLM for chain-of-thought reasoning to identify the referent's bounding box. Evaluated on the Talk2Car benchmark, LLM-RG yields substantial gains over both LLM and VLM-based baselines. Additionally, our ablations show that adding 3D spatial cues further improves grounding. Our results demonstrate the complementary strengths of VLMs and LLMs, applied in a zero-shot manner, for robust outdoor referential grounding.

Problem

Research questions and friction points this paper is trying to address.

Resolving natural-language references in complex outdoor driving scenes

Addressing visual ambiguity through symbolic reasoning and attribute extraction

Improving object grounding accuracy using multimodal visual-spatial information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines vision-language models with large language models

Uses chain-of-thought reasoning for bounding box identification

Integrates spatial metadata with visual descriptors

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling