Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work proposes Dream.exe, a framework that, for the first time, validates the executability of outputs from video generation models through real-world physical execution. Addressing the question of whether generated manipulation videos adhere to physical laws and can be realized by robots, the method integrates video generation models, trajectory extraction algorithms, and a physics simulator to establish an end-to-end video-to-execution evaluation pipeline. Evaluation across 101 manipulation tasks on eight model families reveals that certain models achieve notably high execution success rates, indicating their acquisition of effective physical priors from large-scale training data. The study further uncovers a significant disconnect between visual fidelity and executability, thereby advocating for new evaluation dimensions that go beyond purely perceptual metrics.

📝 Abstract

Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion it depicts should translate into executable robot behavior. We introduce Dream.exe, an evaluation framework that operationalizes this criterion through a video-to-execution pipeline. Given a scene image and a task description, Dream.exe synthesizes a manipulation video, converts the generated motion into robot trajectories, and executes them in a physics simulator, yielding a grounding signal that purely visual metrics cannot offer. Using this pipeline, we evaluate 8 models spanning frontier closed-source generators, open-source generators, and robot-specific models. Our benchmark covers 101 manually curated manipulation tasks at three levels of physical complexity, measured across visual quality, trajectory fidelity, and execution success. Encouragingly, several models achieve measurable execution success, suggesting that generative priors learned from internet-scale data already encode meaningful physical knowledge. Yet visual quality proves a poor predictor of executability, exposing a dimension of model capability that standard visual evaluations do not capture. Dream.exe will be open-sourced at https://github.com/showlab/Dream.exe.

Problem

Research questions and friction points this paper is trying to address.

video generation

robot manipulation

physical grounding

executability

generative models

Innovation

Methods, ideas, or system contributions that make the work stand out.

video-to-execution

robotic manipulation

physical grounding