Vision language models have difficulty recognizing virtual objects

📅 2025-05-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a structural deficiency in vision-language models (VLMs) regarding implicit spatial relation reasoning: when images contain only indirect cues (e.g., “a line beside a tree”) without explicitly depicting the implied object (e.g., “a kite stuck in the tree”), mainstream VLMs—including LLaVA and Qwen-VL—achieve accuracy below 35%, substantially underperforming human baselines. Method: We introduce, for the first time, a scene understanding evaluation paradigm centered on *virtual objects* as prompts, establishing a systematic framework integrating multimodal prompt engineering, controllable scene description generation, and structured spatial relation classification. Contribution/Results: Our framework effectively exposes VLMs’ failure to represent unobserved yet inferable objects spatially, revealing critical gaps in physical commonsense and causal reasoning. It provides a novel benchmark and methodological foundation for diagnosing and enhancing such capabilities in VLMs.

Technology Category

Application Category

📝 Abstract
Vision language models (VLMs) are AI systems paired with both language and vision encoders to process multimodal input. They are capable of performing complex semantic tasks such as automatic captioning, but it remains an open question about how well they comprehend the visuospatial properties of scenes depicted in the images they process. We argue that descriptions of virtual objects -- objects that are not visually represented in an image -- can help test scene comprehension in these AI systems. For example, an image that depicts a person standing under a tree can be paired with the following prompt: imagine that a kite is stuck in the tree. VLMs that comprehend the scene should update their representations and reason sensibly about the spatial relations between all three objects. We describe systematic evaluations of state-of-the-art VLMs and show that their ability to process virtual objects is inadequate.
Problem

Research questions and friction points this paper is trying to address.

VLMs struggle with virtual object recognition
Testing scene comprehension via virtual objects
Current VLMs inadequately process spatial relations
Innovation

Methods, ideas, or system contributions that make the work stand out.

VLMs test scene comprehension with virtual objects
Evaluate spatial relations using virtual object prompts
State-of-the-art VLMs inadequately process virtual objects
🔎 Similar Papers
No similar papers found.
T
Tyler Tran
US Naval Research Laboratory, Washington, DC 20175 USA
Sangeet Khemlani
Sangeet Khemlani
Navy Center for Applied Research in Artificial Intelligence, US Naval Research Laboratory
reasoningcomputational cognitive sciencecognitive modelingexplanationcausal reasoning
J
J. G. Trafton
US Naval Research Laboratory, Washington, DC 20175 USA