🤖 AI Summary
To address the core challenges of low data efficiency and poor generalization in robotic manipulation, this paper proposes a world model framework based on a pre-trained multimodal image generation model—the first to enable zero-shot robotic control. Methodologically, it integrates vision–language representation learning with open-ended future state prediction, coupled with a zero-shot low-level controller, enabling general-purpose manipulation without task-specific training or real-world fine-tuning. Key contributions include: (1) shifting generative world modeling from discriminative paradigm learning to embodied action guidance; and (2) enabling zero-shot transfer across diverse scenes and objects. Evaluated in both simulation and on real robotic platforms, the approach successfully executes heterogeneous manipulation tasks—including grasping, pushing, and insertion—demonstrating strong cross-task generalization and practical deployability.
📝 Abstract
Improving data efficiency and generalization in robotic manipulation remains a core challenge. We propose a novel framework that leverages a pre-trained multimodal image-generation model as a world model to guide policy learning. By exploiting its rich visual-semantic representations and strong generalization across diverse scenes, the model generates open-ended future state predictions that inform downstream manipulation. Coupled with zero-shot low-level control modules, our approach enables general-purpose robotic manipulation without task-specific training. Experiments in both simulation and real-world environments demonstrate that our method achieves effective performance across a wide range of manipulation tasks with no additional data collection or fine-tuning. Supplementary materials are available on our website: https://world4omni.github.io/.