🤖 AI Summary
This work addresses the limitations of conventional tactile sensors in contact-intensive manipulation—namely, their narrow sensing range, low reliability, and high cost—by introducing LVTG, a low-cost, vision-based tactile-integrated gripper. The design incorporates a wide opening angle, a highly wear-resistant skin, and a modular architecture, and for the first time integrates a CLIP-inspired contrastive learning strategy to align visual and tactile embeddings across modalities. When combined with the Action Chunking Transformer (ACT) policy network, LVTG significantly outperforms the original ACT approach in tasks such as grasping large, heavy objects, achieving higher task success rates and improved data efficiency. These results validate the effectiveness of the proposed hardware-algorithm co-design paradigm in contact-rich scenarios.
📝 Abstract
Robotic manipulation in contact-rich environments remains challenging, particularly when relying on conventional tactile sensors that suffer from limited sensing range, reliability, and cost-effectiveness. In this work, we present LVTG, a low-cost visuo-tactile gripper designed for stable, robust, and efficient physical interaction. Unlike existing visuo-tactile sensors, LVTG enables more effective and stable grasping of larger and heavier everyday objects, thanks to its enhanced tactile sensing area and greater opening angle. Its surface skin is made of highly wear-resistant material, significantly improving durability and extending operational lifespan. The integration of vision and tactile feedback allows LVTG to provide rich, high-fidelity sensory data, facilitating reliable perception during complex manipulation tasks. Furthermore, LVTG features a modular design that supports rapid maintenance and replacement. To effectively fuse vision and touch, We adopt a CLIP-inspired contrastive learning objective to align tactile embeddings with their corresponding visual observations, enabling a shared cross-modal representation space for visuo-tactile perception. This alignment improves the performance of an Action Chunking Transformer (ACT) policy in contact-rich manipulation, leading to more efficient data collection and more effective policy learning. Compared to the original ACT method, the proposed LVTG with pretraining achieves significantly higher success rates in manipulation tasks.