🤖 AI Summary
This study addresses the challenge of accurately predicting haptic texture perception attributes to enhance realism in VR/AR interaction and robotic surface understanding. To overcome the poor generalizability of existing unimodal approaches, we first construct a psychophysically calibrated four-dimensional haptic perceptual space—spanning coarse–smooth, flat–bumpy, sticky–slippery, and hard–soft dimensions—and propose a vision–haptics bimodal joint mapping framework. Specifically, a CNN-based autoencoder extracts visual texture features, while a ConvLSTM models temporal haptic signals; multi-feature fusion enables cross-modal perceptual score regression. Under leave-one-out cross-validation, our method achieves significantly lower MAE and RMSE than unimodal baselines and demonstrates strong generalization to unseen textures. This work establishes a novel, interpretable, and transferable paradigm for haptic perception modeling, advancing embodied intelligence and immersive human–machine interaction.
📝 Abstract
Accurate prediction of perceptual attributes of haptic textures is essential for advancing VR and AR applications and enhancing robotic interaction with physical surfaces. This paper presents a deep learning-based multi-modal framework, incorporating visual and tactile data, to predict perceptual texture ratings by leveraging multi-feature inputs. To achieve this, a four-dimensional haptic attribute space encompassing rough-smooth, flat-bumpy, sticky-slippery, and hard-soft dimensions is first constructed through psychophysical experiments, where participants evaluate 50 diverse real-world texture samples. A physical signal space is subsequently created by collecting visual and tactile data from these textures. Finally, a deep learning architecture integrating a CNN-based autoencoder for visual feature learning and a ConvLSTM network for tactile data processing is trained to predict user-assigned attribute ratings. This multi-modal, multi-feature approach maps physical signals to perceptual ratings, enabling accurate predictions for unseen textures. To evaluate predictive accuracy, we employed leave-one-out cross-validation to rigorously assess the model's reliability and generalizability against several machine learning and deep learning baselines. Experimental results demonstrate that the framework consistently outperforms single-modality approaches, achieving lower MAE and RMSE, highlighting the efficacy of combining visual and tactile modalities.