π€ AI Summary
This work addresses the scarcity of large-scale robotic data that limits robot manipulation capabilities by proposing a novel approach to transfer human demonstration knowledge to humanoid robots. Specifically, it fine-tunes the vision-language-action (VLA) model Οβ.β
using only readily available large-scale first-person human operation videos, enabling cross-agent (human-to-humanoid) task transfer and skill composition without any robot demonstration data. This study presents the first demonstration that a five-fingered dexterous hand can comprehend novel task semantics and reuse existing skills solely from human video data. Experimental results show that the proposed method significantly enhances the robotβs generalization and compositional manipulation abilities in zero-robot-data settings.
π Abstract
Robotics faces a fundamental challenge of data scarcity. Unlike language or vision research, there is no internet-scale dataset for robotic manipulation. A promising path forward is to leverage egocentric human data, which can be collected more easily, with greater breadth, and at a larger scale. Towards this end, we investigate key design choices for learning across human and humanoid embodiments equipped with dexterous five-finger hands, using the $Ο_{0.5}$ model as a foundation. Our results show that human data enables robots to learn new task semantics and compose existing skills into novel behaviors without corresponding robot data. The paper website is here: https://egopipaper.github.io/