Ag2x2: Robust Agent-Agnostic Visual Representations for Zero-Shot Bimanual Manipulation

📅 2025-07-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for zero-shot bimanual manipulation with dual-arm robots neglect end-effector states and struggle to model inter-hand coordination. To address this, we propose the first agent-agnostic visual representation framework that explicitly models bimanual synergy. Our approach jointly encodes object dynamics and bimanual motion patterns via contrastive learning and cross-modal encoding—requiring neither human demonstrations nor hand-crafted reward functions. Crucially, it decouples agent-specific information (e.g., pose) from task-invariant features, enabling unified, coordination-aware visual representation. Evaluated on 13 bimanual manipulation tasks, our method achieves a 73.5% zero-shot success rate—substantially outperforming reward-engineered baselines—and maintains robustness in complex scenarios involving deformable objects (e.g., ropes).

Technology Category

Application Category

📝 Abstract
Bimanual manipulation, fundamental to human daily activities, remains a challenging task due to its inherent complexity of coordinated control. Recent advances have enabled zero-shot learning of single-arm manipulation skills through agent-agnostic visual representations derived from human videos; however, these methods overlook crucial agent-specific information necessary for bimanual coordination, such as end-effector positions. We propose Ag2x2, a computational framework for bimanual manipulation through coordination-aware visual representations that jointly encode object states and hand motion patterns while maintaining agent-agnosticism. Extensive experiments demonstrate that Ag2x2 achieves a 73.5% success rate across 13 diverse bimanual tasks from Bi-DexHands and PerAct2, including challenging scenarios with deformable objects like ropes. This performance outperforms baseline methods and even surpasses the success rate of policies trained with expert-engineered rewards. Furthermore, we show that representations learned through Ag2x2 can be effectively leveraged for imitation learning, establishing a scalable pipeline for skill acquisition without expert supervision. By maintaining robust performance across diverse tasks without human demonstrations or engineered rewards, Ag2x2 represents a step toward scalable learning of complex bimanual robotic skills.
Problem

Research questions and friction points this paper is trying to address.

Overcoming bimanual coordination complexity in robotic manipulation
Enhancing zero-shot learning with agent-agnostic visual representations
Improving success rates in diverse bimanual tasks without expert supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Coordination-aware visual representations for bimanual tasks
Encodes object states and hand motion patterns
Agent-agnostic yet robust across diverse tasks
🔎 Similar Papers
No similar papers found.
Z
Ziyin Xiong
National Key Laboratory of General Artificial Intelligence, Beijing Institute for General Artificial Intelligence (BIGAI); School of Psychological and Cognitive Sciences, Peking University; Institute for Artificial Intelligence, Peking University; Beijing Key Laboratory of Behavior and Mental Health, Peking University; Yuanpei College, Peking University
Y
Yinghan Chen
National Key Laboratory of General Artificial Intelligence, Beijing Institute for General Artificial Intelligence (BIGAI); School of Psychological and Cognitive Sciences, Peking University; Institute for Artificial Intelligence, Peking University; Beijing Key Laboratory of Behavior and Mental Health, Peking University; Department of Computer Science and Technology, University of Cambridge
Puhao Li
Puhao Li
Ph.D. Student, Tsinghua University
Computer VisionRoboticsMachine Learning
Yixin Zhu
Yixin Zhu
Assistant Professor, Peking University
Computer VisionVisual ReasoningHuman-Robot Teaming
Tengyu Liu
Tengyu Liu
Beijing Institute for General Artificial Intelligence
computer visionhuman object interactionhuman motion generationgrasping
S
Siyuan Huang
National Key Laboratory of General Artificial Intelligence, Beijing Institute for General Artificial Intelligence (BIGAI)