Tac-DINO: Learning Vision-Tactile Features with Patch Alignment

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the lack of research on alignment mechanisms between visual and tactile signals across local-to-global scales, as well as the absence of high-quality datasets and evaluation benchmarks in tactile learning. To bridge this gap, the authors introduce a large-scale tactile dataset comprising 505 real-world objects and over 20,000 physical interactions, along with the first vision–tactile holistic matching benchmark. They further propose Vision-Tactile Patch Alignment (VTPA), a novel method that leverages patch-level image alignment to model the localized nature of tactile contact. Experimental results demonstrate that VTPA significantly outperforms baseline approaches employing no alignment or whole-image alignment, thereby validating the efficacy of local alignment strategies for cross-modal correspondence.

📝 Abstract

Touch is the primary medium through which humans interact with the environment. Currently, tactile learning mainly focuses on image-level pretraining or alignment. However, tactile signals correspond to local object contact, while research into scale alignment and holographic matching remains limited and proper datasets and benchmarks also lack. To bridge this gap, we first construct a data collection system to acquire a large-scale tactile dataset, with over 20 K tactile contacts from 505 real-world objects. Building on this dataset, we design a Vis-Tac Holographic Matching Benchmark to evaluate vision-tactile local-to-global alignment ability. Then we propose Vision-Tactile Patch Alignment (VTPA) methods for vision-tactile representation learning. Experiments demonstrate that these exceed the performance of methods without alignment and align with whole-object images.

Problem

Research questions and friction points this paper is trying to address.

tactile learning

vision-tactile alignment

patch alignment

holographic matching

tactile dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Tactile Alignment

Patch-level Representation

Tactile Dataset

Holographic Matching Benchmark

Multimodal Learning

🔎 Similar Papers

No similar papers found.