Generalist Robot Manipulation beyond Action Labeled Data

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the reliance of general-purpose robotic manipulation on large-scale, high-quality action-labeled datasets. We propose a self-supervised learning framework that requires no action annotations. Methodologically, we leverage unlabeled human or robot manipulation videos to model dynamic 3D point clouds from hand or gripper regions, integrating a pretrained vision-language model, dense 3D point cloud extraction, and a 3D dynamics predictor—calibrated to action semantics via minimal labeled data. Our key contribution is the first demonstration of learning generalizable manipulation representations directly from raw video, enabling zero-shot transfer across tasks and domains (e.g., simulation-to-real). Experiments show substantial improvements in data efficiency and cross-task generalization, with consistent performance gains validated in both simulated and real-world robotic settings.

Technology Category

Application Category

📝 Abstract
Recent advances in generalist robot manipulation leverage pre-trained Vision-Language Models (VLMs) and large-scale robot demonstrations to tackle diverse tasks in a zero-shot manner. A key challenge remains: scaling high-quality, action-labeled robot demonstration data, which existing methods rely on for robustness and generalization. To address this, we propose a method that benefits from videos without action labels - featuring humans and/or robots in action - enhancing open-vocabulary performance and enabling data-efficient learning of new tasks. Our method extracts dense, dynamic 3D point clouds at the hand or gripper location and uses a proposed 3D dynamics predictor for self-supervision. This predictor is then tuned to an action predictor using a smaller labeled dataset for action alignment. We show that our method not only learns from unlabeled human and robot demonstrations - improving downstream generalist robot policies - but also enables robots to learn new tasks without action labels (i.e., out-of-action generalization) in both real-world and simulated settings.
Problem

Research questions and friction points this paper is trying to address.

Scaling high-quality action-labeled robot demonstration data for robustness
Learning robot manipulation from unlabeled human and robot demonstration videos
Enabling robots to learn new tasks without action labels for generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts dense 3D point clouds at hand locations
Uses 3D dynamics predictor for self-supervision learning
Tunes predictor with small labeled data for action alignment
🔎 Similar Papers
No similar papers found.