🤖 AI Summary
This work addresses the semantic gap between in-the-wild first-person human videos and robot embodiment in zero-shot imitation learning. We propose EmbodiSwap, the first framework to adapt the self-supervised video representation model V-JEPA to synthetic robot video generation—without fine-tuning—to jointly synthesize high-fidelity, temporally coherent robot-embodied videos from human demonstration videos and robot URDF models. By bypassing real-world robot data collection, EmbodiSwap enables end-to-end policy training directly under zero-shot conditions. Evaluated on physical robots, our method achieves an 82% task success rate, substantially outperforming few-shot baselines based on π₀ and π₀ trained on EmbodiSwap-synthesized data. To foster reproducibility and further research, we release our code, synthetic dataset, and pre-trained weights.
📝 Abstract
We introduce EmbodiSwap - a method for producing photorealistic synthetic robot overlays over human video. We employ EmbodiSwap for zero-shot imitation learning, bridging the embodiment gap between in-the-wild ego-centric human video and a target robot embodiment. We train a closed-loop robot manipulation policy over the data produced by EmbodiSwap. We make novel use of V-JEPA as a visual backbone, repurposing V-JEPA from the domain of video understanding to imitation learning over synthetic robot videos. Adoption of V-JEPA outperforms alternative vision backbones more conventionally used within robotics. In real-world tests, our zero-shot trained V-JEPA model achieves an $82%$ success rate, outperforming a few-shot trained $π_0$ network as well as $π_0$ trained over data produced by EmbodiSwap. We release (i) code for generating the synthetic robot overlays which takes as input human videos and an arbitrary robot URDF and generates a robot dataset, (ii) the robot dataset we synthesize over EPIC-Kitchens, HOI4D and Ego4D, and (iii) model checkpoints and inference code, to facilitate reproducible research and broader adoption.