🤖 AI Summary
This work addresses the challenge of perception and decision-making in multimodal robot learning under arbitrary sensor modality missingness. The authors propose a novel framework that integrates conditional variational autoencoders (CVAEs) with Transformer-based attention mechanisms to reconstruct missing modalities and learn a unified, fixed-dimensional, and robust multimodal representation. Unlike prior approaches, this method handles missing modalities seamlessly during both training and inference, making it the first end-to-end framework fully compatible with arbitrary missingness patterns. Evaluated on five standard multimodal datasets, the model significantly outperforms existing fusion methods and demonstrates superior performance in human trajectory prediction and robotic manipulation tasks.
📝 Abstract
Learning with missing modalities is a fundamental challenge in multimodal robot learning, as real-world robotic systems often operate in environments with incomplete sensor data. Attention-based models are appealing for processing multimodal data because they can handle multiple modalities with a single backbone network. However, most multimodal models assume that all modalities are available during both training and inference, limiting their applicability in robotic perception and decision-making. In this paper, we introduce a multimodal model designed to handle missing modalities during both training and inference. The model is formulated as a conditional variational autoencoder (CVAE) and incorporates a transformer-based architecture that leverages attention mechanisms to learn a unified, fixed-dimensional representation, even when some modalities are missing. We show that our proposed model can be trained with missing modalities while approximating a robust representation of all modalities. We evaluate our approach on five multimodal datasets across two robot learning tasks: human trajectory prediction and robot manipulation forecasting. Experimental results demonstrate that our model effectively learns from incomplete data and is superior to prior multimodal fusion approaches.