Grasping Deformable Objects via Reinforcement Learning with Cross-Modal Attention to Visuo-Tactile Inputs

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of simultaneously preventing slippage and rupture when grasping deformable soft-shell objects, this paper proposes a vision–tactile collaborative multimodal deep reinforcement learning method. We introduce a cross-modal attention mechanism pretrained via self-supervision to achieve semantic alignment and fusion of RGB images (encoding global shape and pose) and tactile array data (capturing local contact pressure) directly at the representation level—overcoming feature misalignment inherent in conventional early or late fusion schemes. The fused representations jointly estimate dynamic center-of-mass trajectories and real-time contact states, enabling robust grasp policy learning. Experiments under unknown motion disturbances and on previously unseen objects demonstrate significant improvements in grasp success rate and generalization capability. Results validate that cross-modal attention effectively models dynamic deformation during manipulation, establishing a new paradigm for multimodal control of highly compliant objects.

Technology Category

Application Category

📝 Abstract
We consider the problem of grasping deformable objects with soft shells using a robotic gripper. Such objects have a center-of-mass that changes dynamically and are fragile so prone to burst. Thus, it is difficult for robots to generate appropriate control inputs not to drop or break the object while performing manipulation tasks. Multi-modal sensing data could help understand the grasping state through global information (e.g., shapes, pose) from visual data and local information around the contact (e.g., pressure) from tactile data. Although they have complementary information that can be beneficial to use together, fusing them is difficult owing to their different properties. We propose a method based on deep reinforcement learning (DRL) that generates control inputs of a simple gripper from visuo-tactile sensing information. Our method employs a cross-modal attention module in the encoder network and trains it in a self-supervised manner using the loss function of the RL agent. With the multi-modal fusion, the proposed method can learn the representation for the DRL agent from the visuo-tactile sensory data. The experimental result shows that cross-modal attention is effective to outperform other early and late data fusion methods across different environments including unseen robot motions and objects.
Problem

Research questions and friction points this paper is trying to address.

Grasping deformable objects without dropping or breaking them
Fusing visuo-tactile data for better grasping state understanding
Generating robotic control inputs using cross-modal attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses cross-modal attention for visuo-tactile fusion
Self-supervised deep reinforcement learning approach
Effective across unseen motions and objects
🔎 Similar Papers
No similar papers found.
Y
Yonghyun Lee
Dept. of Electronic Engineering at Sogang University, Seoul, Korea
Sungeun Hong
Sungeun Hong
Associate Professor, Sungkyunkwan University
Multimodal LearningRGB-X LearningDomain AdaptationParameter Efficient ML
M
Min-gu Kim
College of Medicine, Yonsei University, Seoul, Korea
G
Gyeonghwan Kim
Dept. of Electronic Engineering at Sogang University, Seoul, Korea
Changjoo Nam
Changjoo Nam
Associate Professor, Sogang University
Multi-Robot SystemsTask and Motion PlanningManipulation