Intercepting the Future: Latent-Space Predictive World Model for Dynamic VLA Manipulation

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing vision-language-action (VLA) models struggle with dynamic objects due to perceptual-to-execution latency. This work proposes the AHEAD framework, which introduces a lightweight motion-aware world model ahead of a frozen VLA backbone. AHEAD uniquely integrates optical flow–driven velocity and acceleration prediction, task-relevant saliency masking, and an adaptive temporal prediction mechanism—all operating in feature space—enabling dynamic adjustment of future state forecasts without fine-tuning the main model. Evaluated on 20 simulated dynamic tasks, AHEAD achieves success rates of 79%–97%, substantially outperforming baselines (31%–58%). In real-world robotic experiments, it successfully executes challenging tasks such as conveyor-belt grasping, rolling-ball interception, and projectile catching, with most achieving over 90% success, whereas all baseline methods fail completely.

📝 Abstract

Vision-Language-Action (VLA) models generalize across static manipulation but fail when objects move during task execution. They map the current observation to an action and assume the scene is stationary between observation and execution, so at any non-trivial object speed the resulting latency exceeds the time available to grasp. We close this gap with AHEAD (Anticipatory Horizon Extrapolation with Adaptive Dynamics), a predict-then-act wrapper that augments a frozen VLA with a motion-aware latent world model. A small world model trained on manipulation video forecasts future patch tokens in the VLA's feature space, conditioned on per-token velocity and acceleration from optical flow. A language-and-motion saliency mask concentrates prediction on task-relevant patches, and the model rolls forward for an adaptive horizon, halting when prediction uncertainty crosses a threshold. The frozen action decoder then receives the predicted future tokens in place of the current ones. AHEAD adds 4.9M parameters to a frozen 7B OpenVLA and reaches 79 to 97% success across 20 dynamic simulation scenarios where the strongest baseline reaches 31 to 58%. On a physical UFactory xArm 7, AHEAD succeeds on 29/30 to 30/30 on three conveyor and rolling-ball tasks, 23/30 on paddle interception, and 19/30 on projectile catching where every baseline scores 0/30.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

dynamic manipulation

latency

object motion

predictive modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

latent-space world model

predictive control

vision-language-action (VLA)