AIRoA MoMa Dataset: A Large-Scale Hierarchical Dataset for Mobile Manipulation

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing robot datasets lack synchronized force-torque sensing, hierarchical task annotations, and explicit failure logging—hindering research on natural-language-driven, long-horizon, high-contact mobile manipulation. To address this, we introduce the first real-world multimodal robot dataset, collected using the Human Support Robot platform. It synchronously captures RGB video, joint states, 6-DOF wrist wrenches, and internal system states. Crucially, we propose a novel two-level annotation scheme—subgoals paired with primitive actions—and systematically log both successful and failed execution episodes. All data are standardized to the LeRobot v2.1 format. The released dataset comprises 25,469 segments (~94 hours), bridging critical gaps in contact-aware and hierarchically structured robotic data. This resource establishes a foundational benchmark for vision-language-action models, enabling significant improvements in task generalization and robustness under physical interaction.

Technology Category

Application Category

📝 Abstract

As robots transition from controlled settings to unstructured human environments, building generalist agents that can reliably follow natural language instructions remains a central challenge. Progress in robust mobile manipulation requires large-scale multimodal datasets that capture contact-rich and long-horizon tasks, yet existing resources lack synchronized force-torque sensing, hierarchical annotations, and explicit failure cases. We address this gap with the AIRoA MoMa Dataset, a large-scale real-world multimodal dataset for mobile manipulation. It includes synchronized RGB images, joint states, six-axis wrist force-torque signals, and internal robot states, together with a novel two-layer annotation schema of sub-goals and primitive actions for hierarchical learning and error analysis. The initial dataset comprises 25,469 episodes (approx. 94 hours) collected with the Human Support Robot (HSR) and is fully standardized in the LeRobot v2.1 format. By uniquely integrating mobile manipulation, contact-rich interaction, and long-horizon structure, AIRoA MoMa provides a critical benchmark for advancing the next generation of Vision-Language-Action models. The first version of our dataset is now available at https://huggingface.co/datasets/airoa-org/airoa-moma .

Problem

Research questions and friction points this paper is trying to address.

Developing robots that follow natural language instructions in unstructured environments

Addressing lack of multimodal datasets with force-torque sensing and hierarchical annotations

Providing benchmark for contact-rich mobile manipulation and long-horizon tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synchronized multimodal data with force-torque sensing

Two-layer hierarchical annotation schema for actions

Standardized mobile manipulation dataset for long-horizon tasks

🔎 Similar Papers

No similar papers found.

Authors to Follow