World Models for Robotic Manipulation: A Survey

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
This work addresses the lack of a unified predictive framework for world models in robotic manipulation, which has led to fragmented research and ambiguous design choices. Focusing on three core questions—what to predict, how to relate predictions to actions, and when to use predictions—the paper proposes a functional taxonomy that distinguishes between integrated prediction-action models and explicit predictive planners, positioning world models as foundational predictive infrastructure for robot learning. The study presents a systematic review covering latent dynamics models, action-conditioned video generation, 3D/4D scene prediction, physics simulators, and prediction modules in vision–language–action systems. It also consolidates evaluation protocols across 34 manipulation datasets, highlighting open challenges such as contact modeling and hallucination control through the lenses of prediction fidelity, task performance, and simulation reliability.
📝 Abstract
Robotic manipulation depends on the ability to anticipate how actions reshape objects, contacts, and scene geometry before execution. Learned world models provide this capability by predicting task-relevant future evolution under robot intervention, yet the term now spans latent dynamics models, action-conditioned video generators, three- and four-dimensional scene predictors, physics-informed simulators, and predictive modules inside vision-language-action systems. This breadth has fragmented the literature and obscured the design choices that matter for manipulation. We survey world models for robotic manipulation through three questions: what future representation is predicted, how prediction is connected to action, and when prediction is used in the robot-learning pipeline. We operationally define a world model as an action-conditioned predictive system and distinguish it from perception modules, inverse models, policies, rewards, and value functions. We then organize existing work into five representation families, develop a functional taxonomy that separates integrated prediction-action models from explicit predictive planners, and characterize infrastructure roles including synthetic experience generation, candidate filtering, search-based evaluation, learned environments, and outcome verification. We further map these roles across pretraining, post-training, and inference adaptation, review 34 manipulation datasets, and synthesize evaluation protocols for predictive fidelity, task performance, and simulator reliability. This survey shows that world models are evolving from task-specific dynamics predictors into predictive infrastructure for robot learning, while exposing open challenges in contact modeling, hallucination control, action alignment, and benchmarking under closed-loop use.
Problem

Research questions and friction points this paper is trying to address.

world models
robotic manipulation
predictive modeling
action-conditioned prediction
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

world models
robotic manipulation
action-conditioned prediction
predictive infrastructure
functional taxonomy
Fangyuan Wang
Fangyuan Wang
AMD
Z
Ziyuan Wang
Department of Mechanical Engineering and Automation, Harbin Institute of Technology, Shenzhen, China
G
Guorui Pei
College of Robotics Science and Engineering, Taiyuan University of Technology, Taiyuan, China
M
Mengshi Zhang
School of Data Science, City University of Hong Kong (Dongguan), Dongguan, Guangdong, China
C
Canxi Liang
Department of Mechatronic Engineering, Guangdong Polytechnic Normal University, Guangdong, China
J
Jun Hu
School of Advanced Engineering, Great Bay University, Dongguan, Guangdong, China
Z
Zhongxuan Li
School of Computing and Data Science, The University of Hong Kong, Hong Kong SAR, China
Jinsong Wu
Jinsong Wu
University of Chile, Chile
green technologiesdata-driven sustainabilitysustainable engineeringbig dataInternet of things
N
Ning Han
Department of Mechanical Engineering, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China
Zeqing Zhang
Zeqing Zhang
The University of Hong Kong
robotic manipulationmulti-agent systemcollision detection
Jiaming Qi
Jiaming Qi
Northeast Forestry University
RobotShape deformationModel-free adaptive control
Hongmin Wu
Hongmin Wu
Guangdong Academy of Sciences, China
Robot Skill LearningAutonomous ManipulationHuman-Robot Collobration
S
Shiyao Zhang
School of Advanced Engineering, Great Bay University, Dongguan, Guangdong, China
P
Pai Zheng
Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR
Jia Pan
Jia Pan
Computer Science, The University of Hong Kong
Robotics
David Navarro-Alarcon
David Navarro-Alarcon
The Hong Kong Polytechnic University
Robotics
S
Sichao Liu
Department of Production Engineering, KTH Royal Institute of Technology, Stockholm, Sweden
P
Peng Zhou
School of Advanced Engineering, Great Bay University, Dongguan, Guangdong, China