🤖 AI Summary
Existing vision-language-action (VLA) models often suffer from geometric drift, temporal discontinuity, and action instability in long-horizon tasks due to their reliance on non-motion-consistent historical frames or 4D features. This work proposes MotionVLA, the first VLA architecture to incorporate geometrically consistent motion trajectories as a memory interface. By encoding short video windows into compact, temporally continuous trajectory field tokens, MotionVLA replaces discrete frame histories with physically coherent motion evidence, which current visual tokens can query and fuse back into the VLA stream. Emphasizing motion coherence over merely expanding spatiotemporal context, this approach significantly improves long-horizon task performance in both simulated and real-world robotic experiments, yielding smoother and more direct action sequences.
📝 Abstract
Vision-language-action (VLA) models increasingly condition robot policies on history, depth, or 4D features to resolve ambiguity in long-horizon manipulation. However, more spatiotemporal evidence is not necessarily better: when the injected evidence is not motion-consistent, it can introduce geometric drift, fragmented temporal cues, and unstable action generation. This raises a simple question: should a VLA remember past frames, or remember the motion that connects them? We introduce MotionVLA, a motion-history interface that converts a short past-only video window into compact, time-continuous trajectory-field tokens. Instead of treating history as a sparse set of ndependently lifted frames, MotionVLA represents recent observations as physically coherent motion evidence. Current visual tokens query this history to retrieve task-relevant motion information, which is then recoupled into the VLA stream under trajectory-grounded supervision. Experiments across simulation benchmarks and preliminary real-robot rollouts show that MotionVLA improves long-horizon manipulation while producing smoother and more direct executions. These results suggest that effective VLA memory is not just about providing more 4D context, but about exposing motion-consistent evidence that is usable for control.