Co-training with Ego-centric Video and Demonstration for Robot Navigation Task

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the high cost and limited scalability of real robot data for training vision-language-action (VLA) models in mobile robotics. The authors propose a novel approach that, for the first time, transforms human egocentric walking videos into imitation learning data suitable for robotic agents. By leveraging camera motion estimation and action representation translation, the method achieves cross-modal alignment and joint training with real robot data, effectively mitigating the domain gap caused by viewpoint discrepancies. Evaluated on a fruit-search navigation task, the approach significantly enhances the model’s language grounding and action generation capabilities, demonstrating the potential of human-centric videos as a scalable and low-cost data source for robotic learning.

📝 Abstract

Vision-language-action (VLA) models are promising for diverse robotic tasks, but their performance heavily depends on large-scale high-quality training data, whose collection on real robots is costly and time-consuming. While prior work has explored augmenting manipulation datasets with egocentric human videos, applying such approaches to mobile robot navigation remains challenging due to viewpoint changes during locomotion. In this paper, we propose a framework that converts egocentric walking videos into datasets for mobile robot imitation learning. The proposed method estimates camera motion from human videos and transforms it into action representations compatible with ground mobile robots. By jointly training a VLA model on human-derived and robot-collected datasets, the model achieves improved language understanding and more robust action generation than training with either data source alone. Experiments on a fruit-search navigation task demonstrate that human egocentric videos provide an effective and scalable data source for mobile robot learning.

Problem

Research questions and friction points this paper is trying to address.

robot navigation

egocentric video

imitation learning

vision-language-action models

data augmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

ego-centric video

robot navigation

imitation learning

vision-language-action model

camera motion estimation

🔎 Similar Papers

Learning Adaptive Multi-Objective Robot Navigation Incorporating Demonstrations

2024-04-07Citations: 0