Embodied Tree of Thoughts: Deliberate Manipulation Planning with Embodied World Model

πŸ“… 2025-12-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the critical challenges of weak physical grounding, hallucination susceptibility, and poor long-horizon physical consistency in robotic manipulation planning, this paper proposes the Embodied Tree-of-Thought (EToT) framework. EToT employs a physics-based digital twin as its reasoning substrate, integrating domain priors with a reflective branching mechanism: it conducts tree search within a simulated environment to predict action outcomes and iteratively refine manipulation trajectories; failure cases are diagnosed by a vision-language model (VLM), which generates corrective strategies. This work pioneers the deep coupling of embodied world models with Tree-of-Thought search and establishes a Real2Sim2Real transferι—­ηŽ―. Experiments demonstrate that EToT significantly outperforms existing baselines across diverse short- and long-horizon manipulation tasks, markedly improving physical dynamics prediction accuracy, fault recovery capability, and overall task success rate.

Technology Category

Application Category

πŸ“ Abstract
World models have emerged as a pivotal component in robot manipulation planning, enabling agents to predict future environmental states and reason about the consequences of actions before execution. While video-generation models are increasingly adopted, they often lack rigorous physical grounding, leading to hallucinations and a failure to maintain consistency in long-horizon physical constraints. To address these limitations, we propose Embodied Tree of Thoughts (EToT), a novel Real2Sim2Real planning framework that leverages a physics-based interactive digital twin as an embodied world model. EToT formulates manipulation planning as a tree search expanded through two synergistic mechanisms: (1) Priori Branching, which generates diverse candidate execution paths based on semantic and spatial analysis; and (2) Reflective Branching, which utilizes VLMs to diagnose execution failures within the simulator and iteratively refine the planning tree with corrective actions. By grounding high-level reasoning in a physics simulator, our framework ensures that generated plans adhere to rigid-body dynamics and collision constraints. We validate EToT on a suite of short- and long-horizon manipulation tasks, where it consistently outperforms baselines by effectively predicting physical dynamics and adapting to potential failures. Website at https://embodied-tree-of-thoughts.github.io .
Problem

Research questions and friction points this paper is trying to address.

Addresses physical grounding limitations in robot manipulation planning models
Proposes framework combining physics simulation with visual reasoning for planning
Solves long-horizon manipulation tasks while maintaining physical consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Physics-based digital twin as embodied world model
Tree search with Priori and Reflective Branching mechanisms
Real2Sim2Real planning for rigid-body dynamics adherence
πŸ”Ž Similar Papers
No similar papers found.
W
Wenjiang Xu
University of Chinese Academy of Sciences (UCAS), Tsinghua University, Nanjing University
Cindy Wang
Cindy Wang
Medical Student at Columbia VP&S
machine learningbiomedical informaticspsychiatryneonatology
R
Rui Fang
Tsinghua University
M
Mingkang Zhang
Tsinghua University
L
Lusong Li
JD Explore Academy
J
Jing Xu
Tsinghua University
Jiayuan Gu
Jiayuan Gu
Assistant Professor, ShanghaiTech University
Embodied AI3D Vision
Z
Zecui Zeng
JD Explore Academy
R
Rui Chen
Tsinghua University