🤖 AI Summary
Learning dexterous robotic manipulation from a single first-person RGB-D video of human demonstration is highly challenging due to the absence of object pose, geometry, and contact information, and because existing approaches rely on pre-scanned object CAD assets. This work proposes the first single-shot dexterous manipulation learning framework that operates without requiring object CAD models. It recovers contact-consistent trajectories through asset-agnostic hand-object tracking and reconstruction, ego-motion compensation, and adaptive contact refinement. A two-stage residual policy learning scheme, augmented with online quality assessment, is then employed to generate effective robot policies. The method achieves strong performance in both simulation and real-world settings, matching the efficacy of CAD-based approaches on the HOI4D benchmark, and introduces EgoDex-R, a large-scale dataset comprising 4.3 million frames.
📝 Abstract
Egocentric RGB-D videos offer a natural source of human dexterous manipulation demonstrations, but existing data is difficult to use for robot learning because object pose, geometry, and contact information are often missing or require pre-scanned object assets. We present EgoAERO, the first framework that learns dexterous manipulation from a single egocentric RGB-D human demonstration without object assets. EgoAERO reconstructs contact-consistent hand-object trajectories through asset-free object tracking and reconstruction, ego motion compensation, and adaptive contact optimization, then converts them into robot policies using two-stage residual learning. We further introduce an online quality assessment mechanism and construct EgoDex-R, a large-scale egocentric dataset with 4.3M RGB-D frames for dexterous policy learning. Simulation and real-world experiments show that EgoAERO enables single-demonstration dexterous manipulation and achieves downstream performance close to CAD-based reconstructions on HOI4D.