🤖 AI Summary
This work addresses the embodiment gap in directly learning dexterous manipulation policies from human videos by proposing a demonstration-free transfer approach. By extracting 3D keypoints of the hand and object from human demonstrations, the method constructs an autoregressive Transformer policy that uses the wrist and fingertips as a unified observation and action representation, which is deployed directly on a real multi-fingered dexterous hand without any robot-specific demonstrations. This approach achieves, for the first time, high-success-rate manipulation on physical robots using only human video data, attaining a 75.0% success rate on pick-and-place and tool-use tasks—substantially outperforming existing vision-language-action (VLA) baselines (1.0%)—and demonstrates strong generalization to unseen objects and scenes.
📝 Abstract
Robotic foundation models pre-trained on human demonstration videos have shown promise, but a significant embodiment gap remains when the resulting policies are deployed on real robots. A common remedy is to fine-tune these models on robot-specific demonstrations. However, robot data collection can be prohibitively expensive and time-consuming, which is particularly acute in dexterous manipulation, e.g., teleoperating a multi-fingered hand for even a single atomic task can take days. To address this, we introduce Dexterous Point Policy, a framework that learns dexterous manipulation policies directly from human videos and requires no robot demonstrations. Our core insight is that a unified 3D keypoint representation can bridge human and robot embodiments when used for both observations and actions. Specifically, we extract 3D keypoints of task-relevant objects and human hands from raw videos, and train an autoregressive transformer over these keypoints. We observe that at the keypoint level, specifically the wrist and fingertips, human and robot behaviors closely align, enabling direct policy transfer. On a suite of real-robot tasks spanning pick-and-place and tool use, Dexterous Point Policy attains 75.0% success, whereas a state-of-the-art VLA baseline reaches only 1.0%. Furthermore, our method generalizes strongly to unseen scenarios, including multi-object environments and novel object categories.