๐ค AI Summary
This work addresses the lack of research on alignment mechanisms between visual and tactile signals across local-to-global scales, as well as the absence of high-quality datasets and evaluation benchmarks in tactile learning. To bridge this gap, the authors introduce a large-scale tactile dataset comprising 505 real-world objects and over 20,000 physical interactions, along with the first visionโtactile holistic matching benchmark. They further propose Vision-Tactile Patch Alignment (VTPA), a novel method that leverages patch-level image alignment to model the localized nature of tactile contact. Experimental results demonstrate that VTPA significantly outperforms baseline approaches employing no alignment or whole-image alignment, thereby validating the efficacy of local alignment strategies for cross-modal correspondence.
๐ Abstract
Touch is the primary medium through which humans interact with the environment. Currently, tactile learning mainly focuses on image-level pretraining or alignment. However, tactile signals correspond to local object contact, while research into scale alignment and holographic matching remains limited and proper datasets and benchmarks also lack. To bridge this gap, we first construct a data collection system to acquire a large-scale tactile dataset, with over 20 K tactile contacts from 505 real-world objects. Building on this dataset, we design a Vis-Tac Holographic Matching Benchmark to evaluate vision-tactile local-to-global alignment ability. Then we propose Vision-Tactile Patch Alignment (VTPA) methods for vision-tactile representation learning. Experiments demonstrate that these exceed the performance of methods without alignment and align with whole-object images.