AnchorDream: Repurposing Video Diffusion for Embodiment-Aware Robot Data Synthesis

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-world robotic demonstration data is scarce, while simulation data suffers from domain distortion and insufficient embodiment, hindering imitation learning. To address this, we propose an embodiment-aware video diffusion generation framework: leveraging robot kinematic rendering as a hard geometric anchor to guide a pre-trained video diffusion model in jointly synthesizing high-fidelity, diverse manipulation videos that strictly adhere to motion constraints. Our approach innovatively integrates spatiotemporal conditional control with cross-modal anchored distillation, eliminating action hallucination without requiring explicit environment modeling. Critically, it enables large-scale embodied video dataset construction from only a small number of teleoperated demonstrations. Experimental results demonstrate that downstream policies trained on the generated data achieve a 36.4% performance gain in simulation and nearly double the task success rate on real robots.

Technology Category

Application Category

📝 Abstract
The collection of large-scale and diverse robot demonstrations remains a major bottleneck for imitation learning, as real-world data acquisition is costly and simulators offer limited diversity and fidelity with pronounced sim-to-real gaps. While generative models present an attractive solution, existing methods often alter only visual appearances without creating new behaviors, or suffer from embodiment inconsistencies that yield implausible motions. To address these limitations, we introduce AnchorDream, an embodiment-aware world model that repurposes pretrained video diffusion models for robot data synthesis. AnchorDream conditions the diffusion process on robot motion renderings, anchoring the embodiment to prevent hallucination while synthesizing objects and environments consistent with the robot's kinematics. Starting from only a handful of human teleoperation demonstrations, our method scales them into large, diverse, high-quality datasets without requiring explicit environment modeling. Experiments show that the generated data leads to consistent improvements in downstream policy learning, with relative gains of 36.4% in simulator benchmarks and nearly double performance in real-world studies. These results suggest that grounding generative world models in robot motion provides a practical path toward scaling imitation learning.
Problem

Research questions and friction points this paper is trying to address.

Synthesizes diverse robot data from few demonstrations
Addresses embodiment inconsistencies in generative models
Improves imitation learning performance in real-world tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Repurposes video diffusion for robot data synthesis
Anchors embodiment to prevent hallucination in motions
Scales few demonstrations into large diverse datasets
🔎 Similar Papers
No similar papers found.
J
Junjie Ye
Toyota Research Institute, USC Physical Superintelligence (PSI) Lab
R
Rong Xue
USC Physical Superintelligence (PSI) Lab
B
Basile Van Hoorick
Toyota Research Institute
Pavel Tokmakov
Pavel Tokmakov
ML Research Scientist, TRI
Computer VisionMachine Learning
Muhammad Zubair Irshad
Muhammad Zubair Irshad
Toyota Research Institute | Georgia Institute of Technology
3D VisionRobot LearningFoundation ModelsGenerative AIMultimodal AI
Y
Yue Wang
USC Physical Superintelligence (PSI) Lab
V
Vitor Guizilini
Toyota Research Institute