🤖 AI Summary
This work addresses the limitation of existing methods in jointly modeling multidimensional physical attributes (e.g., fragility, material) and spatial structure for vision-language understanding of object relationships—critical for task-oriented robotic manipulation planning. We propose the first VLM-driven end-to-end manipulation framework: (1) fine-tuning a visual language model to extract object attributes; (2) constructing an explicit spatial relation tree from image captions; (3) performing semantic re-planning over the tree via a language model; and (4) generating action sequences through multimodal representation fusion. Experiments demonstrate significant improvements in spatial reasoning accuracy under complex scenes. On real-robot tasks, our approach achieves higher task completion rates and more physically plausible manipulations, validating the effectiveness of attribute-augmented, structured spatial representations.
📝 Abstract
Vision-Language Models (VLMs) play a crucial role in robotic manipulation by enabling robots to understand and interpret the visual properties of objects and their surroundings, allowing them to perform manipulation based on this multimodal understanding. However, understanding object attributes and spatial relationships is a non-trivial task but is critical in robotic manipulation tasks. In this work, we present a new dataset focused on spatial relationships and attribute assignment and a novel method to utilize VLMs to perform object manipulation with task-oriented, high-level input. In this dataset, the spatial relationships between objects are manually described as captions. Additionally, each object is labeled with multiple attributes, such as fragility, mass, material, and transparency, derived from a fine-tuned vision language model. The embedded object information from captions are automatically extracted and transformed into a data structure (in this case, tree, for demonstration purposes) that captures the spatial relationships among the objects within each image. The tree structures, along with the object attributes, are then fed into a language model to transform into a new tree structure that determines how these objects should be organized in order to accomplish a specific (high-level) task. We demonstrate that our method not only improves the comprehension of spatial relationships among objects in the visual environment but also enables robots to interact with these objects more effectively. As a result, this approach significantly enhances spatial reasoning in robotic manipulation tasks. To our knowledge, this is the first method of its kind in the literature, offering a novel solution that allows robots to more efficiently organize and utilize objects in their surroundings.