ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance

📅 2025-04-23
📈 Citations: 0
Influential: 0
📄 PDF

career value

218K/year
🤖 AI Summary
To address imprecise instruction following and low visual fidelity in robotic manipulation video synthesis, this paper proposes a world model integrating action-semantic modeling with multimodal visual guidance. Methodologically, it introduces (1) an action tree structure that explicitly encodes hierarchical dependencies among actions specified in instructions, and (2) a unified visual guidance adapter that jointly fuses depth and semantic features to enhance physical plausibility and spatiotemporal consistency. The framework integrates action-tree embeddings, the multimodal adapter, diffusion-based video generation, and a hierarchical instruction-action conditioning mechanism. Evaluated on unseen RLBench tasks, our approach achieves a PSNR of 21.05 (+1.5), SSIM of 0.7982 (+0.0508), optical flow error of 3.201 (−0.305), and a 2.5% improvement in average task success rate over prior methods.

Technology Category

Application Category

📝 Abstract
While recent advancements in robotic manipulation video synthesis have shown promise, significant challenges persist in ensuring effective instruction-following and achieving high visual quality. Recent methods, like RoboDreamer, utilize linguistic decomposition to divide instructions into separate lower-level primitives, conditioning the world model on these primitives to achieve compositional instruction-following. However, these separate primitives do not consider the relationships that exist between them. Furthermore, recent methods neglect valuable visual guidance, including depth and semantic guidance, both crucial for enhancing visual quality. This paper introduces ManipDreamer, an advanced world model based on the action tree and visual guidance. To better learn the relationships between instruction primitives, we represent the instruction as the action tree and assign embeddings to tree nodes, each instruction can acquire its embeddings by navigating through the action tree. The instruction embeddings can be used to guide the world model. To enhance visual quality, we combine depth and semantic guidance by introducing a visual guidance adapter compatible with the world model. This visual adapter enhances both the temporal and physical consistency of video generation. Based on the action tree and visual guidance, ManipDreamer significantly boosts the instruction-following ability and visual quality. Comprehensive evaluations on robotic manipulation benchmarks reveal that ManipDreamer achieves large improvements in video quality metrics in both seen and unseen tasks, with PSNR improved from 19.55 to 21.05, SSIM improved from 0.7474 to 0.7982 and reduced Flow Error from 3.506 to 3.201 in unseen tasks, compared to the recent RoboDreamer model. Additionally, our method increases the success rate of robotic manipulation tasks by 2.5% in 6 RLbench tasks on average.
Problem

Research questions and friction points this paper is trying to address.

Improving instruction-following in robotic manipulation world models
Enhancing visual quality with depth and semantic guidance
Modeling relationships between instruction primitives using action trees
Innovation

Methods, ideas, or system contributions that make the work stand out.

Action tree embeds relationships between instruction primitives
Visual guidance adapter enhances video consistency
Combines depth and semantic guidance for quality
Y
Ying Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University, Beijing, China
Xiaobao Wei
Xiaobao Wei
Institute of Software, Chinese Academy of Sciences
3D Vision
Xiaowei Chi
Xiaowei Chi
The Hong Kong University of Science and Technology
Multimodal GenerationRoboticsComputer Vision
Yuming Li
Yuming Li
Peking University
Z
Zhongyu Zhao
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University, Beijing, China
H
Hao Wang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University, Beijing, China
Ningning Ma
Ningning Ma
Unknown affiliation
Autonomous DrivingCognitive ComputingComputer Vision
M
Ming Lu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University, Beijing, China
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models