🤖 AI Summary
Existing robot datasets lack synchronized force-torque sensing, hierarchical task annotations, and explicit failure logging—hindering research on natural-language-driven, long-horizon, high-contact mobile manipulation. To address this, we introduce the first real-world multimodal robot dataset, collected using the Human Support Robot platform. It synchronously captures RGB video, joint states, 6-DOF wrist wrenches, and internal system states. Crucially, we propose a novel two-level annotation scheme—subgoals paired with primitive actions—and systematically log both successful and failed execution episodes. All data are standardized to the LeRobot v2.1 format. The released dataset comprises 25,469 segments (~94 hours), bridging critical gaps in contact-aware and hierarchically structured robotic data. This resource establishes a foundational benchmark for vision-language-action models, enabling significant improvements in task generalization and robustness under physical interaction.
📝 Abstract
As robots transition from controlled settings to unstructured human environments, building generalist agents that can reliably follow natural language instructions remains a central challenge. Progress in robust mobile manipulation requires large-scale multimodal datasets that capture contact-rich and long-horizon tasks, yet existing resources lack synchronized force-torque sensing, hierarchical annotations, and explicit failure cases. We address this gap with the AIRoA MoMa Dataset, a large-scale real-world multimodal dataset for mobile manipulation. It includes synchronized RGB images, joint states, six-axis wrist force-torque signals, and internal robot states, together with a novel two-layer annotation schema of sub-goals and primitive actions for hierarchical learning and error analysis. The initial dataset comprises 25,469 episodes (approx. 94 hours) collected with the Human Support Robot (HSR) and is fully standardized in the LeRobot v2.1 format. By uniquely integrating mobile manipulation, contact-rich interaction, and long-horizon structure, AIRoA MoMa provides a critical benchmark for advancing the next generation of Vision-Language-Action models. The first version of our dataset is now available at https://huggingface.co/datasets/airoa-org/airoa-moma .