π€ AI Summary
This work addresses the challenge of automatically transferring dexterous robotic skills from a single human demonstration video, overcoming perceptual inaccuracies and kinematic discrepancies between humans and robots. The authors propose an end-to-end framework that leverages foundation models to reconstruct a digital twin of the human manipulation sequence and extract motion priors. Object-centric keyframe optimization is employed to decouple perception robustness from hand-object interaction dynamics by refining robot poses. The approach integrates imitation learning with residual reinforcement learning and incorporates collision-aware motion planning to enhance spatial generalization. Evaluated on multiple everyday manipulation tasks, the method significantly improves success rates, trajectory consistency, and safety in simulation, and demonstrates superior sim-to-real transfer performance compared to existing approaches.
π Abstract
Human manipulation videos are a convenient and intuitive source for robot learning. However, directly transferring human dexterity to robots remains challenging due to perception errors and embodiment gap. To address this, we introduce Video2Sim2Real, a full-stack framework for autonomous skill acquisition from a single human manipulation video. Our framework first uses off-the-shelf foundation models to reconstruct a simulator-ready digital twin and extract robot and object motion priors. Rather than treating the extracted robot motion as a reliable reference throughout execution, our key idea is to recover and leverage the most fundamental sources of supervision from the demonstrated skill: We identify object-centric keyframes to optimize the corresponding robot configurations using object information from the simulator, and use these configurations as anchors that refine the robot motion such that it ultimately has the desired impact on the environment. To bridge the remaining sim-to-real gap, we introduce a sim-to-real strategy that decouples robustness to noisy and incomplete perception from variations in hand-object interaction dynamics. Specifically, we learn to recalibrate robot configurations from noisy real-world point clouds via IL, and leverage residual RL to perform local finger-level adaptations to ensure for robust and effective interactions. Finally, a collision-aware motion planning module enables spatial generalization to novel object configurations. Across several everyday manipulation tasks, Video2Sim2Real improves simulated task success, safety, and trajectory coherence over numerous baselines, and achieves better sim-to-real transfer than existing techniques. These results demonstrate a promising path toward autonomous dexterous skill acquisition from human videos.