Ego-Pi: VLA Fine-Tuning for Ego-Centric Human and Robot Data

📅 2026-06-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the scarcity of large-scale robotic data that limits robot manipulation capabilities by proposing a novel approach to transfer human demonstration knowledge to humanoid robots. Specifically, it fine-tunes the vision-language-action (VLA) model π₀.₅ using only readily available large-scale first-person human operation videos, enabling cross-agent (human-to-humanoid) task transfer and skill composition without any robot demonstration data. This study presents the first demonstration that a five-fingered dexterous hand can comprehend novel task semantics and reuse existing skills solely from human video data. Experimental results show that the proposed method significantly enhances the robot’s generalization and compositional manipulation abilities in zero-robot-data settings.

📝 Abstract

Robotics faces a fundamental challenge of data scarcity. Unlike language or vision research, there is no internet-scale dataset for robotic manipulation. A promising path forward is to leverage egocentric human data, which can be collected more easily, with greater breadth, and at a larger scale. Towards this end, we investigate key design choices for learning across human and humanoid embodiments equipped with dexterous five-finger hands, using the $π_{0.5}$ model as a foundation. Our results show that human data enables robots to learn new task semantics and compose existing skills into novel behaviors without corresponding robot data. The paper website is here: https://egopipaper.github.io/

Problem

Research questions and friction points this paper is trying to address.

data scarcity

robotic manipulation

egocentric human data

embodiment

task semantics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ego-centric data

VLA fine-tuning

cross-embodiment learning