🤖 AI Summary
This work addresses the challenges of learning robotic manipulation from human video demonstrations, including noisy hand–object interactions, partial observability of objects, and morphological discrepancies between humans and robots. To tackle these issues, the authors propose HOWTransfer, a hand-centric framework that leverages hand–object interaction cues to reconstruct 3D hand motion and precisely identify contact moments. It then generates multimodal parallel-jaw grasping hypotheses, propagates them along the wrist trajectory to produce executable robot trajectories, and refines contact alignment through trajectory editing while generating diverse action variants. Notably, the method operates without object-specific descriptions, vision-language queries, or explicit state tracking, achieving high-quality human-to-robot trajectory transfer in open-world settings based solely on contact localization. Evaluated across multiple manipulation tasks, it attains an 86% success rate, with blind tests demonstrating superior performance over teleoperated trajectories.
📝 Abstract
Learning from human video demonstrations remains challenging due to noisy hand-object interactions, unseen objects with partial observation, and cross-embodiment discrepancy. To address these challenges, we present \textit{HOWTransfer} (\emph{H}and-\emph{O}bject \emph{O}pen-\emph{W}orld Transfer), a hand-centric framework that distills human demonstrations into contact-aware, taxonomy-informed, and diverse robotic trajectories. Instead of relying on object-specific descriptions, vision-language queries, or explicit object-state tracking, \emph{HOWTransfer} recovers temporally consistent 3D hand motion and localizes temporal contact intervals by reasoning over observed hand-object interaction cues. The localized contact onsets are then used to retarget human grasp intent into multi-modal parallel-jaw grasp hypotheses, which are propagated along the recovered wrist trajectory to generate robot-executable motions. Finally, a trajectory editing stage refines contact alignment and produces diverse executable variants from a single demonstration. Experiments across diverse manipulation tasks show that \emph{HOWTransfer} enables accurate contact localization and high-quality robot motion retargeting with $86\%$ success, which is preferred over teleoperated trajectories in a blinded preference study.