Evaluating VLMs' Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences

📅 2026-03-13

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This study presents the first systematic evaluation of vision-language models (VLMs) in robotic motion planning, specifically assessing their ability to interpret spatial relationships and user preferences—such as object proximity and trajectory style. Focusing on state-of-the-art models including Qwen2.5-VL and GPT-4o, the authors design four query strategies and conduct zero-shot and fine-tuned experiments on a custom dataset. Results show that Qwen2.5-VL achieves a zero-shot accuracy of 71.4% under optimal prompting, which further improves to 75% after fine-tuning, significantly outperforming GPT-4o. These findings demonstrate the promising potential of VLMs for motion planning tasks while highlighting an inherent trade-off between reasoning accuracy and computational overhead.

Technology Category

Application Category

📝 Abstract

Understanding user instructions and object spatial relations in surrounding environments is crucial for intelligent robot systems to assist humans in various tasks. The natural language and spatial reasoning capabilities of Vision-Language Models (VLMs) have the potential to enhance the generalization of robot planners on new tasks, objects, and motion specifications. While foundation models have been applied to task planning, it is still unclear the degree to which they have the capability of spatial reasoning required to enforce user preferences or constraints on motion, such as desired distances from objects, topological properties, or motion style preferences. In this paper, we evaluate the capability of four state-of-the-art VLMs at spatial reasoning over robot motion, using four different querying methods. Our results show that, with the highest-performing querying method, Qwen2.5-VL achieves 71.4% accuracy zero-shot and 75% on a smaller model after fine-tuning, and GPT-4o leads to lower performance. We evaluate two types of motion preferences (object-proximity and path-style), and we also analyze the trade-off between accuracy and computation cost in number of tokens. This work shows some promise in the potential of VLM integration with robot motion planning pipelines.

Problem

Research questions and friction points this paper is trying to address.

spatial reasoning

vision-language models

robot motion planning

motion preferences

user instructions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models

Spatial Reasoning

Robot Motion Planning