🤖 AI Summary
Existing methods for zero-shot bimanual manipulation with dual-arm robots neglect end-effector states and struggle to model inter-hand coordination. To address this, we propose the first agent-agnostic visual representation framework that explicitly models bimanual synergy. Our approach jointly encodes object dynamics and bimanual motion patterns via contrastive learning and cross-modal encoding—requiring neither human demonstrations nor hand-crafted reward functions. Crucially, it decouples agent-specific information (e.g., pose) from task-invariant features, enabling unified, coordination-aware visual representation. Evaluated on 13 bimanual manipulation tasks, our method achieves a 73.5% zero-shot success rate—substantially outperforming reward-engineered baselines—and maintains robustness in complex scenarios involving deformable objects (e.g., ropes).
📝 Abstract
Bimanual manipulation, fundamental to human daily activities, remains a challenging task due to its inherent complexity of coordinated control. Recent advances have enabled zero-shot learning of single-arm manipulation skills through agent-agnostic visual representations derived from human videos; however, these methods overlook crucial agent-specific information necessary for bimanual coordination, such as end-effector positions. We propose Ag2x2, a computational framework for bimanual manipulation through coordination-aware visual representations that jointly encode object states and hand motion patterns while maintaining agent-agnosticism. Extensive experiments demonstrate that Ag2x2 achieves a 73.5% success rate across 13 diverse bimanual tasks from Bi-DexHands and PerAct2, including challenging scenarios with deformable objects like ropes. This performance outperforms baseline methods and even surpasses the success rate of policies trained with expert-engineered rewards. Furthermore, we show that representations learned through Ag2x2 can be effectively leveraged for imitation learning, establishing a scalable pipeline for skill acquisition without expert supervision. By maintaining robust performance across diverse tasks without human demonstrations or engineered rewards, Ag2x2 represents a step toward scalable learning of complex bimanual robotic skills.