🤖 AI Summary
This work addresses the lack of a unified predictive framework for world models in robotic manipulation, which has led to fragmented research and ambiguous design choices. Focusing on three core questions—what to predict, how to relate predictions to actions, and when to use predictions—the paper proposes a functional taxonomy that distinguishes between integrated prediction-action models and explicit predictive planners, positioning world models as foundational predictive infrastructure for robot learning. The study presents a systematic review covering latent dynamics models, action-conditioned video generation, 3D/4D scene prediction, physics simulators, and prediction modules in vision–language–action systems. It also consolidates evaluation protocols across 34 manipulation datasets, highlighting open challenges such as contact modeling and hallucination control through the lenses of prediction fidelity, task performance, and simulation reliability.
📝 Abstract
Robotic manipulation depends on the ability to anticipate how actions reshape objects, contacts, and scene geometry before execution. Learned world models provide this capability by predicting task-relevant future evolution under robot intervention, yet the term now spans latent dynamics models, action-conditioned video generators, three- and four-dimensional scene predictors, physics-informed simulators, and predictive modules inside vision-language-action systems. This breadth has fragmented the literature and obscured the design choices that matter for manipulation. We survey world models for robotic manipulation through three questions: what future representation is predicted, how prediction is connected to action, and when prediction is used in the robot-learning pipeline. We operationally define a world model as an action-conditioned predictive system and distinguish it from perception modules, inverse models, policies, rewards, and value functions. We then organize existing work into five representation families, develop a functional taxonomy that separates integrated prediction-action models from explicit predictive planners, and characterize infrastructure roles including synthetic experience generation, candidate filtering, search-based evaluation, learned environments, and outcome verification. We further map these roles across pretraining, post-training, and inference adaptation, review 34 manipulation datasets, and synthesize evaluation protocols for predictive fidelity, task performance, and simulator reliability. This survey shows that world models are evolving from task-specific dynamics predictors into predictive infrastructure for robot learning, while exposing open challenges in contact modeling, hallucination control, action alignment, and benchmarking under closed-loop use.