🤖 AI Summary
To address the challenges of modeling temporal proprioceptive data and poor generalization in foot-ground contact estimation for quadrupedal robots, this paper proposes a cross-modal temporal-spatial image encoding method. Multi-source signals—including joint angles, IMU measurements, and foot-end velocities—are mapped onto a 2D structured image according to the robot’s kinematic topology, explicitly preserving morphological connectivity and dynamic temporal dependencies. A lightweight convolutional neural network is then employed for end-to-end contact state prediction. Compared with conventional RNN/LSTM approaches, our method achieves substantial improvements in both simulation and real-world experiments: contact accuracy increases from 87.7% to 94.5%, while requiring only 1/15 the temporal window length—enhancing real-time performance and cross-terrain generalization. The core contribution lies in constructing a physically interpretable, structured image representation that bridges proprioceptive sensing with the spatiotemporal modeling strengths of CNNs.
📝 Abstract
This paper presents a novel approach for representing proprioceptive time-series data from quadruped robots as structured two-dimensional images, enabling the use of convolutional neural networks for learning locomotion-related tasks. The proposed method encodes temporal dynamics from multiple proprioceptive signals, such as joint positions, IMU readings, and foot velocities, while preserving the robot's morphological structure in the spatial arrangement of the image. This transformation captures inter-signal correlations and gait-dependent patterns, providing a richer feature space than direct time-series processing. We apply this concept in the problem of contact estimation, a key capability for stable and adaptive locomotion on diverse terrains. Experimental evaluations on both real-world datasets and simulated environments show that our image-based representation consistently enhances prediction accuracy and generalization over conventional sequence-based models, underscoring the potential of cross-modal encoding strategies for robotic state learning. Our method achieves superior performance on the contact dataset, improving contact state accuracy from 87.7% to 94.5% over the recently proposed MI-HGNN method, using a 15 times shorter window size.