🤖 AI Summary
This work addresses the limitation of existing robotic world models, which support only open-loop evaluation and thus cannot effectively simulate vision-language-action (VLA) policies in closed-loop settings. To overcome this, the authors propose PiL-World, the first modular world model enabling closed-loop evaluation of VLA policies through alternating inference between the policy and the model, generating multi-view future observations without real-world execution. Key innovations include action-derived visual control and latent historical context to enhance generation fidelity, along with joint training on both successful and failed trajectories to better match the true policy distribution. Evaluated on three dual-arm manipulation tasks, PiL-World reduces the closed-loop success rate estimation error from 63.2% to 12.0%, producing simulated trajectories that closely align with real-world executions.
📝 Abstract
Vision-language-action (VLA) policies operate in a closed loop in real-world robot tasks: a robot observes the scene, executes an action chunk, and conditions its next decision on the resulting observation. However, most existing world models for robot action evaluation are limited to open-loop prediction along pre-collected action trajectories. This prevents them from supporting closed-loop VLA evaluation, where each action chunk must be conditioned on the observation generated by the previous execution. To address this gap, we propose PiL-World, a chunk-wise world model designed for policy-in-the-loop VLA evaluation. Given the current observation and the action trajectory rolled out by a VLA policy, PiL-World generates multi-view future observations that are consistent with the VLA rollout and match the image inputs required by the policy. By alternating between VLA inference and world-model prediction, PiL-World enables closed-loop evaluation without real robot execution at every step. To improve rollout fidelity, PiL-World conditions video generation on action-derived visual control from head-view robot motion and latent histories that encode task execution context, while jointly predicting complementary multi-view observations. Beyond successful teleoperated demonstrations, it also learns from failed execution trajectories, helping the imagined rollouts better match the distribution of real policy executions. We evaluate PiL-World on three real dual-arm manipulation tasks. PiL-World generates imagined rollouts that are highly consistent with real robot executions. More importantly, compared with the baseline, it reduces the error between VLA success rates measured in real-world rollouts and those estimated through closed-loop world-model evaluation from 63.2% to 12.0%.