π€ AI Summary
This work addresses the poor performance of existing vision-language models in collaborative dialogue tasks requiring spatial reasoning, where accurately interpreting and reconstructing visually grounded spatial structures remains challenging. The study introduces, for the first time, a multi-turn, multi-agent collaborative dialogue framework that leverages linguistic interaction to guide models in reconstructing target structures from visual and textual inputs. It systematically evaluates the impact of different image representations and input modalities on spatial reasoning capabilities. Experiments combine both open- and closed-source models, employing decomposed image representations and fine-grained textual descriptions for multimodal structure reconstruction. Results reveal significant limitations in current modelsβ ability to perform spatial visual grounding; although detailed text and decomposed representations improve reconstruction success rates, overall performance remains constrained, highlighting the fundamental boundaries of existing approaches on this task.
π Abstract
Robots operating in diverse environments rely on visual input to interpret objects and spatial layouts. In human-collaborative tasks, they are expected to communicate this understanding through language. Vision-language models (VLMs) support robotic tasks involving visual interpretation, question answering, and instruction following, but their capabilities in collaborative dialogue tasks requiring spatial reasoning remain underexplored. We study this gap through a collaborative structure-building task that combines visual interpretation, grounding, language-guided interaction, and action generation. We develop a framework in which VLMs use dialogue to reconstruct a target structure from visual and textual inputs. We evaluate open-weight and closed VLMs across interaction settings, input modalities, and image representations. Results show that spatial reasoning over visual representations remains difficult for the evaluated VLMs. Detailed text representations of the target yield higher reconstruction success across modality conditions, while decomposed image representations improve performance. These findings reveal limits in visual spatial grounding and grounded instruction generation for collaborative VLM agents.