🤖 AI Summary
Existing benchmarks struggle to effectively evaluate the control complexity and failure modes inherent in dual-arm collaborative manipulation. To address this gap, this work proposes DuoBench—the first reproducible evaluation framework specifically designed for dual-arm tasks—encompassing 11 distinct tasks across four categories of collaboration and supporting deployment in both simulation and real-world environments. DuoBench introduces staged fine-grained assessment, cross-modal policy evaluation spanning vision, language, and action, and a human teleoperation dataset, with task consistency ensured through the FR3 Duo platform and 3D-printed assets. Experimental results demonstrate that current policies exhibit significant limitations in early-stage interaction, concurrent bimanual execution, and sim-to-real transfer, and that DuoBench can effectively uncover and semantically interpret these failure modes.
📝 Abstract
Bimanual robot systems substantially expand manipulation capabilities, but coordinating two arms introduces additional control complexity and failure modes that are not well captured by existing benchmarks. We introduce DuoBench, an extensible benchmarking framework for bimanual manipulation policies on the FR3 Duo platform. DuoBench comprises eleven tasks spanning four coordination categories, implemented in simulation and partially reproduced in the real world through reproducible task recipes with 3D-printable assets. In addition, we propose a stage-based evaluation scheme that supports fine-grained semantic failure analysis beyond binary success and provide human-teleoperated datasets for all benchmark tasks. We benchmark several dual-arm imitation-learning and vision-language-action policies in simulation and on real hardware. Our results show that current policies remain challenged by bimanual manipulation, particularly in early interaction stages, parallel arm execution, and transfer between simulation and real-world settings. DuoBench provides a reproducible testbed for diagnosing these failure modes and studying future methods for dual-arm policy learning. Code, datasets, and videos are available at https://duobench.github.io/