What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos?

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Effectively transferring manipulation policies from everyday human videos—rather than expert demonstrations—to robots remains challenging due to inherent embodiment discrepancies between humans and robots. This work introduces a new dataset comprising 532 in-the-wild human manipulation videos and systematically investigates how hand pose quality and action distribution affect policy transfer. The analysis reveals that even with accurate pose estimation, fundamental differences in human and robotic action capabilities significantly impede transfer performance. To address this, the authors propose a cross-embodiment co-training framework that enables the visual and policy networks to specialize according to their respective embodiment constraints while being jointly optimized. Evaluated on six manipulation tasks under low-data regimes, the method improves absolute success rates by 29.7%.

📝 Abstract

Human video datasets used for cotraining robot manipulation policies largely consist of curated demonstrations where motions are orchestrated to resemble robot behavior and 3D hand poses are captured with specialized hardware. A more plentiful source of data is everyday Internet video, but it is an open question what factors enable transfer from such videos to robots. We investigate this using a new dataset of 532 human videos with 28 hours of high-quality triangulated hand labels and natural motions. We find that hand pose quality affects transfer, but even with accurate hands, the inherent motion gap hinders transfer unless the vision and policy networks specialize to each embodiment. Our cotraining recipe yields consistent improvements, with an absolute success rate gain of $29.7\%$ in the low-robot-data regime across six manipulation tasks.

Problem

Research questions and friction points this paper is trying to address.

cotraining

robot manipulation

human videos

transfer learning

embodiment gap

Innovation

Methods, ideas, or system contributions that make the work stand out.

cotraining

robot manipulation

human video