TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

📅 2024-12-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large vision-language-action (VLA) models exhibit significant limitations in modeling spatiotemporal dynamics for complex manipulation tasks—e.g., grasping. To address this, we propose Visual Trajectory Prompting (VTP), a novel paradigm that encodes state-action sequences into compact visual trajectory cues, explicitly enhancing VLA models’ spatiotemporal reasoning capability. Our approach leverages a self-collected dataset of 150K manipulation trajectories, combined with trajectory-to-image encoding and multi-scenario joint fine-tuning. Applied to the lightweight 4B Phi-3-Vision VLA model, VTP achieves real-robot performance and generalization on par with the significantly larger 7B OpenVLA. On SimplerEnv’s 137 task configurations, VTP outperforms OpenVLA by 10% in success rate; on physical WidowX robots, it improves task success by 3.5×. Moreover, VTP enables robust cross-morphology and cross-scenario transfer, demonstrating unprecedented adaptability for resource-constrained VLA deployment.

Technology Category

Application Category

📝 Abstract
Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. In this work, we introduce visual trace prompting, a simple yet effective approach to facilitate VLA models' spatial-temporal awareness for action prediction by encoding state-action trajectories visually. We develop a new TraceVLA model by finetuning OpenVLA on our own collected dataset of 150K robot manipulation trajectories using visual trace prompting. Evaluations of TraceVLA across 137 configurations in SimplerEnv and 4 tasks on a physical WidowX robot demonstrate state-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and 3.5x on real-robot tasks and exhibiting robust generalization across diverse embodiments and scenarios. To further validate the effectiveness and generality of our method, we present a compact VLA model based on 4B Phi-3-Vision, pretrained on the Open-X-Embodiment and finetuned on our dataset, rivals the 7B OpenVLA baseline while significantly improving inference efficiency.
Problem

Research questions and friction points this paper is trying to address.

Visual Language Action (VLA) Models
SpatialTemporal Understanding
Robot Interaction Tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Trajectory Prompting
Enhanced Spatial Temporal Understanding
Efficiency Improvement in Robotic Tasks
🔎 Similar Papers
No similar papers found.