Vision-based Manipulation from Single Human Video with Open-World Object Graphs

πŸ“… 2024-05-30
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 40
✨ Influential: 1
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of enabling robots to generalize vision-driven manipulation skills to unseen objects in open-world environments from a single human RGB-D demonstration video. To this end, we propose ORIONβ€”a novel one-shot imitation learning framework grounded in an open-world object graph, requiring no predefined object categories or environmental priors. Its core components include object-centric modeling, RGB-D video parsing, manipulation graph extraction, conditional policy learning, and multimodal representation alignment. ORION achieves strong generalization across varying backgrounds, viewpoints, scene layouts, and previously unseen object instances, enabling robust operation planning and policy transfer. Experiments demonstrate that ORION significantly outperforms existing baselines on both short- and long-horizon tasks. It supports real-world deployment using consumer-grade devices (e.g., iPad) and successfully transfers policies to diverse physical environments, accomplishing zero-shot manipulation of novel objects.

Technology Category

Application Category

πŸ“ Abstract
We present an object-centric approach to empower robots to learn vision-based manipulation skills from human videos. We investigate the problem of imitating robot manipulation from a single human video in the open-world setting, where a robot must learn to manipulate novel objects from one video demonstration. We introduce ORION, an algorithm that tackles the problem by extracting an object-centric manipulation plan from a single RGB-D video and deriving a policy that conditions on the extracted plan. Our method enables the robot to learn from videos captured by daily mobile devices such as an iPad and generalize the policies to deployment environments with varying visual backgrounds, camera angles, spatial layouts, and novel object instances. We systematically evaluate our method on both short-horizon and long-horizon tasks, demonstrating the efficacy of ORION in learning from a single human video in the open world. Videos can be found in the project website https://ut-austin-rpl.github.io/ORION-release.
Problem

Research questions and friction points this paper is trying to address.

Learning robot manipulation from single human videos
Generalizing policies to novel objects and environments
Extracting object-centric plans from RGB or RGB-D videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-centric manipulation plan extraction
Single RGB or RGB-D video learning
Generalization to novel object instances
πŸ”Ž Similar Papers
No similar papers found.
Y
Yifeng Zhu
The University of Texas at Austin
A
Arisrei Lim
The University of Texas at Austin
P
Peter Stone
The University of Texas at Austin, Sony AI
Yuke Zhu
Yuke Zhu
The University of Texas at Austin, NVIDIA Research
Robot LearningComputer VisionMachine LearningRoboticsArtificial Intelligence