Developing Vision-Language-Action Model from Egocentric Videos

πŸ“… 2025-09-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing visual-language-action (VLA) models for egocentric manipulation heavily rely on expert teleoperation or fine-grained annotations (e.g., hand poses), hindering scalable, low-cost pretraining. Method: We propose EgoScalerβ€”a novel framework that automatically extracts 6-DoF object manipulation trajectories from unlabeled first-person videos and applies noise-robust filtering to construct a large-scale, high-quality VLA pretraining dataset. Leveraging the Ο€β‚€ architecture, we perform end-to-end policy learning and evaluation in both simulation and real-robot settings. Contribution/Results: Pretraining solely on EgoScaler-generated data improves task success rates by over 20%, matching performance achieved with real-robot-collected data. Joint training on both datasets further enhances generalization. EgoScaler eliminates the need for manual annotation or teleoperation, establishing a scalable, cost-effective paradigm for VLA model pretraining.

Technology Category

Application Category

πŸ“ Abstract
Egocentric videos capture how humans manipulate objects and tools, providing diverse motion cues for learning object manipulation. Unlike the costly, expert-driven manual teleoperation commonly used in training Vision-Language-Action models (VLAs), egocentric videos offer a scalable alternative. However, prior studies that leverage such videos for training robot policies typically rely on auxiliary annotations, such as detailed hand-pose recordings. Consequently, it remains unclear whether VLAs can be trained directly from raw egocentric videos. In this work, we address this challenge by leveraging EgoScaler, a framework that extracts 6DoF object manipulation trajectories from egocentric videos without requiring auxiliary recordings. We apply EgoScaler to four large-scale egocentric video datasets and automatically refine noisy or incomplete trajectories, thereby constructing a new large-scale dataset for VLA pre-training. Our experiments with a state-of-the-art $Ο€_0$ architecture in both simulated and real-robot environments yield three key findings: (i) pre-training on our dataset improves task success rates by over 20% compared to training from scratch, (ii) the performance is competitive with that achieved using real-robot datasets, and (iii) combining our dataset with real-robot data yields further improvements. These results demonstrate that egocentric videos constitute a promising and scalable resource for advancing VLA research.
Problem

Research questions and friction points this paper is trying to address.

Training Vision-Language-Action models directly from raw egocentric videos
Eliminating reliance on costly manual teleoperation and auxiliary annotations
Automatically extracting object manipulation trajectories from videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts 6DoF object trajectories from egocentric videos
Automatically refines noisy trajectories to build dataset
Uses egocentric videos for scalable VLA pre-training
πŸ”Ž Similar Papers
No similar papers found.