EmbodiedMAE: A Unified 3D Multi-Modal Representation for Robot Manipulation

๐Ÿ“… 2025-05-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Addressing the significant domain gap between simulation-based training and real-robot deployment, as well as the limited capability of existing models to effectively fuse RGB, depth, and point cloud modalities, this work introduces DROID-3Dโ€”the first high-quality, embodied manipulation-oriented 3D multimodal datasetโ€”and EmbodiedMAE, a novel multimodal masked autoencoder supporting joint masked reconstruction of RGB, depth maps, and point clouds. Its key innovations include: (i) the first implementation of cross-modal feature alignment and unified 3D representation learning; and (ii) geometric consistency modeling between point clouds and images to enhance 3D perception. Evaluated on 70 simulated and 20 real-robot manipulation tasks, EmbodiedMAE consistently outperforms state-of-the-art vision foundation models: it reduces policy transfer sample requirements by 60%, improves training efficiency by 40%, and accelerates convergence by 2.3ร—.

Technology Category

Application Category

๐Ÿ“ Abstract
We present EmbodiedMAE, a unified 3D multi-modal representation for robot manipulation. Current approaches suffer from significant domain gaps between training datasets and robot manipulation tasks, while also lacking model architectures that can effectively incorporate 3D information. To overcome these limitations, we enhance the DROID dataset with high-quality depth maps and point clouds, constructing DROID-3D as a valuable supplement for 3D embodied vision research. Then we develop EmbodiedMAE, a multi-modal masked autoencoder that simultaneously learns representations across RGB, depth, and point cloud modalities through stochastic masking and cross-modal fusion. Trained on DROID-3D, EmbodiedMAE consistently outperforms state-of-the-art vision foundation models (VFMs) in both training efficiency and final performance across 70 simulation tasks and 20 real-world robot manipulation tasks on two robot platforms. The model exhibits strong scaling behavior with size and promotes effective policy learning from 3D inputs. Experimental results establish EmbodiedMAE as a reliable unified 3D multi-modal VFM for embodied AI systems, particularly in precise tabletop manipulation settings where spatial perception is critical.
Problem

Research questions and friction points this paper is trying to address.

Bridges domain gaps between training datasets and robot manipulation tasks
Lacks effective 3D information integration in model architectures
Improves 3D perception for precise tabletop manipulation in robotics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhances DROID dataset with 3D depth and point clouds
Develops multi-modal masked autoencoder for RGB, depth, point clouds
Outperforms state-of-the-art models in 90 robot tasks
๐Ÿ”Ž Similar Papers
No similar papers found.