🤖 AI Summary
This work addresses the reliance of general-purpose robotic manipulation on large-scale, high-quality action-labeled datasets. We propose a self-supervised learning framework that requires no action annotations. Methodologically, we leverage unlabeled human or robot manipulation videos to model dynamic 3D point clouds from hand or gripper regions, integrating a pretrained vision-language model, dense 3D point cloud extraction, and a 3D dynamics predictor—calibrated to action semantics via minimal labeled data. Our key contribution is the first demonstration of learning generalizable manipulation representations directly from raw video, enabling zero-shot transfer across tasks and domains (e.g., simulation-to-real). Experiments show substantial improvements in data efficiency and cross-task generalization, with consistent performance gains validated in both simulated and real-world robotic settings.
📝 Abstract
Recent advances in generalist robot manipulation leverage pre-trained Vision-Language Models (VLMs) and large-scale robot demonstrations to tackle diverse tasks in a zero-shot manner. A key challenge remains: scaling high-quality, action-labeled robot demonstration data, which existing methods rely on for robustness and generalization. To address this, we propose a method that benefits from videos without action labels - featuring humans and/or robots in action - enhancing open-vocabulary performance and enabling data-efficient learning of new tasks. Our method extracts dense, dynamic 3D point clouds at the hand or gripper location and uses a proposed 3D dynamics predictor for self-supervision. This predictor is then tuned to an action predictor using a smaller labeled dataset for action alignment. We show that our method not only learns from unlabeled human and robot demonstrations - improving downstream generalist robot policies - but also enables robots to learn new tasks without action labels (i.e., out-of-action generalization) in both real-world and simulated settings.