Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models

📅 2025-11-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) exhibit limited robustness on spatial language grounding tasks—particularly under referential ambiguity, deeply nested spatial relations (e.g., multi-layer prepositional structures), and negation. This work introduces the first systematic evaluation of VLMs’ spatial reasoning capabilities via referring expression comprehension (REC), covering core spatial semantics: topological, directional, and proximity relations. We benchmark both task-specific architectures and general-purpose large VLMs, revealing systematic modeling biases and generalization bottlenecks: most models struggle with long-range dependencies and logical negation, and consistently underperform on directional relations relative to topological ones. Our study establishes a reproducible, fine-grained evaluation benchmark for spatial semantics in VLMs and identifies concrete directions for improvement—namely, enhanced compositional reasoning over spatial modifiers and explicit handling of negation and directional constraints.

Technology Category

Application Category

📝 Abstract
Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation ('not'). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at hand, the relative behaviors depend on the underlying models and the specific categories of spatial semantics (topological, directional, proximal, etc.). Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.
Problem

Research questions and friction points this paper is trying to address.

Evaluating spatial reasoning difficulties in vision-language models
Analyzing spatial comprehension with ambiguous object detection and negation
Investigating model performance with complex spatial expressions and relations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using Referring Expression Comprehension for spatial evaluation
Analyzing ambiguity, complex relations, and negation cases
Comparing task-specific and large vision-language models performance
🔎 Similar Papers
No similar papers found.