Surformer v2: A Multimodal Classifier for Surface Understanding from Touch and Vision

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses material recognition on robot-interacted surfaces by proposing a vision–tactile multimodal fusion framework to enhance material perception during manipulation and interaction. Methodologically, it employs an Efficient V-Net for visual feature extraction and an encoder-only Transformer to model tactile time-series sequences—marking the first application of a pure Transformer architecture to tactile sequence modeling. A lightweight, learnable decision-level logits weighting mechanism is introduced for end-to-end joint optimization. Key contributions include: (i) improved temporal representation capability via Transformer-based tactile modeling; and (ii) a context-aware, adaptive fusion strategy that balances flexibility and computational efficiency. Evaluated on the Touch and Go dataset, the proposed method achieves state-of-the-art classification accuracy while maintaining low inference latency—satisfying real-time perception requirements for robotic systems.

Technology Category

Application Category

📝 Abstract
Multimodal surface material classification plays a critical role in advancing tactile perception for robotic manipulation and interaction. In this paper, we present Surformer v2, an enhanced multi-modal classification architecture designed to integrate visual and tactile sensory streams through a late(decision level) fusion mechanism. Building on our earlier Surformer v1 framework [1], which employed handcrafted feature extraction followed by mid-level fusion architecture with multi-head cross-attention layers, Surformer v2 integrates the feature extraction process within the model itself and shifts to late fusion. The vision branch leverages a CNN-based classifier(Efficient V-Net), while the tactile branch employs an encoder-only transformer model, allowing each modality to extract modality-specific features optimized for classification. Rather than merging feature maps, the model performs decision-level fusion by combining the output logits using a learnable weighted sum, enabling adaptive emphasis on each modality depending on data context and training dynamics. We evaluate Surformer v2 on the Touch and Go dataset [2], a multi-modal benchmark comprising surface images and corresponding tactile sensor readings. Our results demonstrate that Surformer v2 performs well, maintaining competitive inference speed, suitable for real-time robotic applications. These findings underscore the effectiveness of decision-level fusion and transformer-based tactile modeling for enhancing surface understanding in multi-modal robotic perception.
Problem

Research questions and friction points this paper is trying to address.

Develop multimodal classifier integrating vision and touch
Enhance surface material classification for robotic perception
Implement decision-level fusion for adaptive modality emphasis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Late fusion mechanism for vision and tactile integration
Transformer-based tactile encoder for feature extraction
Learnable weighted sum for adaptive modality emphasis
🔎 Similar Papers
No similar papers found.