🤖 AI Summary
This paper addresses two key bottlenecks in tactile-language multimodal commonsense reasoning for open physical scenarios: (1) modality mismatch—where tactile signals are oversimplified as linguistic submodalities—and (2) scarcity of open tactile data. To this end, we propose an adaptive multimodal understanding framework. Methodologically, we introduce the first tactile-specific Mixture-of-Experts (MoE) dynamic routing mechanism to enable fine-grained cross-modal coordination; construct PhysiCLeAR—the first tactile commonsense reasoning benchmark covering eight physical attributes, four interaction modalities, and supporting open-ended question answering; and integrate cross-modal alignment representation learning, tactile-language joint pretraining, and physics-aware prompt engineering. Experiments demonstrate that our framework significantly outperforms state-of-the-art models on both PhysiCLeAR and a proprietary test set, validating the effectiveness and generalizability of the MoE architecture for tactile-language collaborative reasoning.
📝 Abstract
This paper explores the challenges of integrating tactile sensing into intelligent systems for multimodal reasoning, particularly in enabling commonsense reasoning about the open-ended physical world. We identify two key challenges: modality discrepancy, where existing large touch-language models often treat touch as a mere sub-modality of language, and open-ended tactile data scarcity, where current datasets lack the diversity, open-endness and complexity needed for reasoning. To overcome these challenges, we introduce SToLa, a Self-Adaptive Touch-Language framework. SToLa utilizes Mixture of Experts (MoE) to dynamically process, unify, and manage tactile and language modalities, capturing their unique characteristics. Crucially, we also present a comprehensive tactile commonsense reasoning dataset and benchmark featuring free-form questions and responses, 8 physical properties, 4 interactive characteristics, and diverse commonsense knowledge. Experiments show SToLa exhibits competitive performance compared to existing models on the PhysiCLeAR benchmark and self-constructed datasets, proving the effectiveness of the Mixture of Experts architecture in multimodal management and the performance advantages for open-scenario tactile commonsense reasoning tasks.