🤖 AI Summary
This work addresses the challenge of acquiring high-quality teleoperation data for training whole-body visuomotor policies in humanoid robots by proposing a framework that eliminates the need for physical robot involvement during data collection. The approach leverages lightweight VR equipment to capture sparse keypoint trajectories from human demonstrations while simultaneously recording first-person visual data from the wrist perspective. A high-level policy network is trained to predict future motion trajectories, which are then mapped onto the robot’s morphology through a keypoint retargeting mechanism and executed by a whole-body controller. Extending the UMI paradigm to humanoid robotics for the first time, this method enables efficient generation of diverse and agile robot behaviors using only human demonstrations. The framework’s effectiveness and generalizability are validated across two experimental scenarios, demonstrating seamless transfer of natural human motions to humanoid robots.
📝 Abstract
High-quality data collection is a fundamental cornerstone for training humanoid whole-body visuomotor policies. Current data acquisition paradigms predominantly rely on robot teleoperation, which is often hindered by limited hardware accessibility and low operational efficiency. Inspired by the Universal Manipulation Interface (UMI), we propose BifrostUMI, a portable, efficient, and robot-free data collection framework tailored for humanoid robots. BifrostUMI leverages lightweight VR devices to capture human demonstrations as sparse keypoint trajectories while simultaneously recording wrist-mounted visual data. These multimodal data are subsequently utilized to train a high-level policy network that predicts future keypoint trajectories conditioned on the captured visual features. Through a robust keypoint retargeting pipeline, keypoint trajectories are precisely mapped onto the robot's morphology and executed via a whole-body controller. This approach enables the seamless transfer of diverse and agile behaviors from natural human demonstrations to humanoid embodiments. We demonstrate the efficacy and versatility of the proposed framework across two distinct experimental scenarios.