Masquerade: Learning from In-the-wild Human Videos using Data-Editing

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Robot manipulation is hindered by the scarcity and limited diversity of real-robot demonstration data. To address this, we propose a novel paradigm—“human-to-robot video embodiment”—that transforms in-the-wild first-person human videos into embodied-consistent robot demonstrations via 3D hand pose estimation, arm image inpainting, and bimanual robot trajectory rendering, thereby bridging the visual–motor gap between humans and robots. Our method integrates vision encoder pretraining, auxiliary future 2D keypoint prediction, and diffusion-model-driven joint optimization of the policy head. Evaluated on three unseen kitchen tasks, our approach achieves a 5–6× performance gain over baselines using only 50 real-robot demonstrations per task—marking the first empirical validation that large-scale human videos can be efficiently transferred to bimanual robotic manipulation policy learning.

Technology Category

Application Category

📝 Abstract
Robot manipulation research still suffers from significant data scarcity: even the largest robot datasets are orders of magnitude smaller and less diverse than those that fueled recent breakthroughs in language and vision. We introduce Masquerade, a method that edits in-the-wild egocentric human videos to bridge the visual embodiment gap between humans and robots and then learns a robot policy with these edited videos. Our pipeline turns each human video into robotized demonstrations by (i) estimating 3-D hand poses, (ii) inpainting the human arms, and (iii) overlaying a rendered bimanual robot that tracks the recovered end-effector trajectories. Pre-training a visual encoder to predict future 2-D robot keypoints on 675K frames of these edited clips, and continuing that auxiliary loss while fine-tuning a diffusion policy head on only 50 robot demonstrations per task, yields policies that generalize significantly better than prior work. On three long-horizon, bimanual kitchen tasks evaluated in three unseen scenes each, Masquerade outperforms baselines by 5-6x. Ablations show that both the robot overlay and co-training are indispensable, and performance scales logarithmically with the amount of edited human video. These results demonstrate that explicitly closing the visual embodiment gap unlocks a vast, readily available source of data from human videos that can be used to improve robot policies.
Problem

Research questions and friction points this paper is trying to address.

Bridging visual embodiment gap between humans and robots
Editing human videos to create robotized demonstrations
Improving robot policies with limited real robot data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Editing human videos to bridge embodiment gap
Robot overlay tracks recovered hand trajectories
Pre-training encoder with auxiliary loss scaling
🔎 Similar Papers
No similar papers found.