🤖 AI Summary
This work addresses the challenge of teaching robots complex manipulation skills—such as pouring, wiping, and stirring—using only AI-generated videos, without physical demonstrations or robot-side training. Methodologically, it leverages text-to-video diffusion models to synthesize action videos, employs vision-language models to automatically filter semantically consistent video segments, extracts object motion trajectories via 6D pose estimation, and maps these trajectories to robot execution in an embodiment-agnostic manner. Its key contribution is the first demonstration of using purely synthetic video as supervision for end-to-end closed-loop transfer from generated visual data to real-world robot control. Experiments show that the approach achieves performance on par with human demonstrations, improves consistently with higher-generation video fidelity, and significantly outperforms baseline methods—including keypoint prediction and dense feature tracking—in both accuracy and generalization.
📝 Abstract
This work introduces Robots Imitating Generated Videos (RIGVid), a system that enables robots to perform complex manipulation tasks--such as pouring, wiping, and mixing--purely by imitating AI-generated videos, without requiring any physical demonstrations or robot-specific training. Given a language command and an initial scene image, a video diffusion model generates potential demonstration videos, and a vision-language model (VLM) automatically filters out results that do not follow the command. A 6D pose tracker then extracts object trajectories from the video, and the trajectories are retargeted to the robot in an embodiment-agnostic fashion. Through extensive real-world evaluations, we show that filtered generated videos are as effective as real demonstrations, and that performance improves with generation quality. We also show that relying on generated videos outperforms more compact alternatives such as keypoint prediction using VLMs, and that strong 6D pose tracking outperforms other ways to extract trajectories, such as dense feature point tracking. These findings suggest that videos produced by a state-of-the-art off-the-shelf model can offer an effective source of supervision for robotic manipulation.