🤖 AI Summary
This work addresses the absence of standardized, reproducible evaluation benchmarks for real-world robotic manipulation tailored to Universal Manipulation Interface (UMI)-style policies. To bridge this gap, we introduce the first physical-world evaluation platform specifically designed for UMI policies, which standardizes the entire pipeline—from data collection to deployment evaluation—through unified protocols for data acquisition, automated scene resetting, policy execution, and structured logging. Built upon the UMI data paradigm, our platform integrates wrist-mounted visual observations with canonical action representations, enabling open, auditable, and quantitative assessment of policy generalization and reliability in tabletop manipulation tasks.
📝 Abstract
Real-robot evaluation is essential for understanding whether learned manipulation policies can operate reliably outside curated demonstrations. This need is particularly pressing for Universal Manipulation Interface (UMI)-style policies, whose performance depends on the coupling between wrist-view observations, action representation, data collection, and physical deployment. Existing real-world benchmarks have made important progress, but they are not designed around this UMI data-to-deployment setting. We present UMI-Bench 1.0, a local-first real-robot benchmark for standardized evaluation of UMI-style manipulation policies. To the best of our knowledge, this is the first benchmark dedicated to real-world evaluation of UMI-based manipulation models. UMI-Bench aligns data collection, scene reset, policy execution, result logging, and task-factor analysis within a unified protocol. By making the full evaluation process reproducible and auditable, UMI-Bench provides a practical testbed for measuring how UMI-trained policies generalize to real physical manipulation.