M2R2: MulitModal Robotic Representation for Temporal Action Segmentation

📅 2025-04-25

📈 Citations: 1

✨ Influential: 0

🤖 AI Summary

This work addresses two key challenges in robotic multimodal temporal action segmentation (TAS): (1) difficulty in reusing learned features across tasks and models, and (2) degradation of visual feature reliability under low-visibility conditions. To this end, we propose M2R2, a transferable multimodal feature extractor—the first unified framework for robotic TAS that jointly models *intrinsic* (e.g., joint angles, torques) and *extrinsic* (e.g., RGB, depth) sensory modalities within an ontology-aware representation space. We introduce a novel modality-cooperative pretraining strategy that combines multimodal feature alignment with cross-modal contrastive learning to disentangle modality-specific dependencies and enable robust cross-model feature reuse. Evaluated on the REASSEMBLE benchmark, M2R2 achieves state-of-the-art performance, outperforming the previous best robotic TAS method by 46.6%. Ablation studies confirm the critical contribution of our multimodal cooperative pretraining to performance gains.

Technology Category

Application Category

📝 Abstract

Temporal action segmentation (TAS) has long been a key area of research in both robotics and computer vision. In robotics, algorithms have primarily focused on leveraging proprioceptive information to determine skill boundaries, with recent approaches in surgical robotics incorporating vision. In contrast, computer vision typically relies on exteroceptive sensors, such as cameras. Existing multimodal TAS models in robotics integrate feature fusion within the model, making it difficult to reuse learned features across different models. Meanwhile, pretrained vision-only feature extractors commonly used in computer vision struggle in scenarios with limited object visibility. In this work, we address these challenges by proposing M2R2, a multimodal feature extractor tailored for TAS, which combines information from both proprioceptive and exteroceptive sensors. We introduce a novel pretraining strategy that enables the reuse of learned features across multiple TAS models. Our method achieves state-of-the-art performance on the REASSEMBLE dataset, a challenging multimodal robotic assembly dataset, outperforming existing robotic action segmentation models by 46.6%. Additionally, we conduct an extensive ablation study to evaluate the contribution of different modalities in robotic TAS tasks.

Problem

Research questions and friction points this paper is trying to address.

Integrating multimodal sensors for robotic action segmentation

Enabling feature reuse across different segmentation models

Improving performance in limited object visibility scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal feature extractor for TAS

Novel pretraining strategy for feature reuse

Combines proprioceptive and exteroceptive sensors

🔎 Similar Papers

No similar papers found.

Authors to Follow