World4Omni: A Zero-Shot Framework from Image Generation World Model to Robotic Manipulation

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the core challenges of low data efficiency and poor generalization in robotic manipulation, this paper proposes a world model framework based on a pre-trained multimodal image generation model—the first to enable zero-shot robotic control. Methodologically, it integrates vision–language representation learning with open-ended future state prediction, coupled with a zero-shot low-level controller, enabling general-purpose manipulation without task-specific training or real-world fine-tuning. Key contributions include: (1) shifting generative world modeling from discriminative paradigm learning to embodied action guidance; and (2) enabling zero-shot transfer across diverse scenes and objects. Evaluated in both simulation and on real robotic platforms, the approach successfully executes heterogeneous manipulation tasks—including grasping, pushing, and insertion—demonstrating strong cross-task generalization and practical deployability.

Technology Category

Application Category

📝 Abstract
Improving data efficiency and generalization in robotic manipulation remains a core challenge. We propose a novel framework that leverages a pre-trained multimodal image-generation model as a world model to guide policy learning. By exploiting its rich visual-semantic representations and strong generalization across diverse scenes, the model generates open-ended future state predictions that inform downstream manipulation. Coupled with zero-shot low-level control modules, our approach enables general-purpose robotic manipulation without task-specific training. Experiments in both simulation and real-world environments demonstrate that our method achieves effective performance across a wide range of manipulation tasks with no additional data collection or fine-tuning. Supplementary materials are available on our website: https://world4omni.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Improving data efficiency in robotic manipulation
Enhancing generalization across diverse manipulation tasks
Enabling zero-shot control without task-specific training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages pre-trained image-generation world model
Zero-shot low-level control modules integration
No task-specific training or fine-tuning needed
🔎 Similar Papers
No similar papers found.
H
Haonan Chen
School of Computing, National University of Singapore
Bangjun Wang
Bangjun Wang
University of Hong Kong
Artificial IntelligenceRobot LearningComputer Vision
Jingxiang Guo
Jingxiang Guo
National University of Singapore
Manipulation
T
Tianrui Zhang
Institute for Interdisciplinary Information Sciences, Tsinghua University
Yiwen Hou
Yiwen Hou
National University of Singapore
Reinforcement Learning
Xuchuan Huang
Xuchuan Huang
Peking University
Robot LearningDexterous Manipulation
Chenrui Tie
Chenrui Tie
National University of Singapore
roboticsreinforcement learning
L
Lin Shao
School of Computing, National University of Singapore; NUS Guangzhou Research Translation and Innovation Institute