🤖 AI Summary
This work addresses the challenge of fusing multimodal, heterogeneous tactile signals with disparate temporal resolutions in contact-rich robotic manipulation by introducing the MiTaS framework. MiTaS is the first to systematically integrate RGB vision, low-frequency GelSight Mini tactile data, and high-frequency event-based Evetac tactile signals. The approach employs modality-specific convolutional backbones for feature extraction, followed by cross-modal fusion via a Transformer architecture, and leverages a flow-matching strategy for conditional imitation learning. Experiments demonstrate an average success rate of 80% across five contact-intensive tasks, substantially outperforming vision-only (31%) and vision–single-tactile (54%) baselines. Notably, even when Evetac is removed during testing, the co-training strategy still yields over a 10% performance gain, confirming the effectiveness and generalization capability of multimodal tactile synergy.
📝 Abstract
Touch sensing is beneficial for solving a wide variety of manipulation tasks. While there exists a wide range of tactile sensors with different properties, exploiting the fusion of multiple heterogeneous tactile sensors to improve manipulation learning remains underexplored. We present Multi-Resolution Tactile Sensing (MiTaS), a representation framework that leverages multiple tactile sensors operating at different temporal resolutions in order to solve complex contact-rich manipulation tasks. We propose a novel architecture using modality-specific convolutional stems and transformer-based fusion that effectively fuses information from an RGB camera stream, a vision-based GelSight Mini sensor and a high-frequency event-based Evetac sensor. This multi-sensor representation then conditions a flow-matching policy for solving downstream tasks. Experimental results across five contact-rich manipulation tasks demonstrate the effectiveness of multi-resolution tactile features in imitation learning. MiTaS achieves an average success rate of 80 %, while vision-only (31 %) and visual-tactile (54 %) baselines cannot solve the task reliably. Co-training a visuo-tactile model with multi-tactile data boosts performance by over 10 \% in certain tasks, without having access to the Evetac sensor during policy evaluation. A detailed sensor-reading and attention analysis reveals the importance of different sensors throughout task execution, validating our multi-resolution tactile sensing approach. Project Page: http://mitas-touch.github.io.