EmbodiSwap for Zero-Shot Robot Imitation Learning

📅 2025-10-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the semantic gap between in-the-wild first-person human videos and robot embodiment in zero-shot imitation learning. We propose EmbodiSwap, the first framework to adapt the self-supervised video representation model V-JEPA to synthetic robot video generation—without fine-tuning—to jointly synthesize high-fidelity, temporally coherent robot-embodied videos from human demonstration videos and robot URDF models. By bypassing real-world robot data collection, EmbodiSwap enables end-to-end policy training directly under zero-shot conditions. Evaluated on physical robots, our method achieves an 82% task success rate, substantially outperforming few-shot baselines based on π₀ and π₀ trained on EmbodiSwap-synthesized data. To foster reproducibility and further research, we release our code, synthetic dataset, and pre-trained weights.

Technology Category

Application Category

📝 Abstract
We introduce EmbodiSwap - a method for producing photorealistic synthetic robot overlays over human video. We employ EmbodiSwap for zero-shot imitation learning, bridging the embodiment gap between in-the-wild ego-centric human video and a target robot embodiment. We train a closed-loop robot manipulation policy over the data produced by EmbodiSwap. We make novel use of V-JEPA as a visual backbone, repurposing V-JEPA from the domain of video understanding to imitation learning over synthetic robot videos. Adoption of V-JEPA outperforms alternative vision backbones more conventionally used within robotics. In real-world tests, our zero-shot trained V-JEPA model achieves an $82%$ success rate, outperforming a few-shot trained $π_0$ network as well as $π_0$ trained over data produced by EmbodiSwap. We release (i) code for generating the synthetic robot overlays which takes as input human videos and an arbitrary robot URDF and generates a robot dataset, (ii) the robot dataset we synthesize over EPIC-Kitchens, HOI4D and Ego4D, and (iii) model checkpoints and inference code, to facilitate reproducible research and broader adoption.
Problem

Research questions and friction points this paper is trying to address.

Bridging embodiment gap between human videos and robot imitation
Generating photorealistic synthetic robot overlays for training
Adapting V-JEPA from video understanding to robot imitation learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates photorealistic synthetic robot overlays on human videos
Uses V-JEPA visual backbone repurposed for imitation learning
Trains closed-loop robot policy on synthetic data
🔎 Similar Papers
No similar papers found.
E
Eadom Dessalene
Department of Computer Science, University of Maryland, College Park, MD, 20742
Pavan Mantripragada
Pavan Mantripragada
University Maryland, College Park
RoboticsManipulationGraspingTactile Sensing
M
Michael Maynord
Department of Computer Science, University of Maryland, College Park, MD, 20742
Y
Yiannis Aloimonos
Department of Computer Science, University of Maryland, College Park, MD, 20742