VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation

📅 2025-05-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of language-conditioned, contact-intensive robotic manipulation (e.g., fingertip insertion), focusing on robust fusion of vision–tactile multimodal signals for language-guided policy learning. We propose the first unified vision–tactile–language–action modeling framework, enabling deep cross-modal perception integration via language alignment. Crucially, we introduce a novel Direct Preference Optimization (DPO) paradigm tailored for continuous control—replacing conventional token-level classification losses. To support training, we construct a low-cost, simulation-generated multimodal instruction dataset (vision–tactile–action–instruction). Experiments demonstrate over 90% success rate on unseen peg-insertion tasks, substantially outperforming diffusion-based policies and TLA/VLA baselines. Moreover, our method exhibits strong sim-to-real transfer capability.

Technology Category

Application Category

📝 Abstract

While vision-language models have advanced significantly, their application in language-conditioned robotic manipulation is still underexplored, especially for contact-rich tasks that extend beyond visually dominant pick-and-place scenarios. To bridge this gap, we introduce Vision-Tactile-Language-Action model, a novel framework that enables robust policy generation in contact-intensive scenarios by effectively integrating visual and tactile inputs through cross-modal language grounding. A low-cost, multi-modal dataset has been constructed in a simulation environment, containing vision-tactile-action-instruction pairs specifically designed for the fingertip insertion task. Furthermore, we introduce Direct Preference Optimization (DPO) to offer regression-like supervision for the VTLA model, effectively bridging the gap between classification-based next token prediction loss and continuous robotic tasks. Experimental results show that the VTLA model outperforms traditional imitation learning methods (e.g., diffusion policies) and existing multi-modal baselines (TLA/VLA), achieving over 90% success rates on unseen peg shapes. Finally, we conduct real-world peg-in-hole experiments to demonstrate the exceptional Sim2Real performance of the proposed VTLA model. For supplementary videos and results, please visit our project website: https://sites.google.com/view/vtla

Problem

Research questions and friction points this paper is trying to address.

Bridging vision-language models for contact-rich robotic manipulation tasks

Integrating visual-tactile inputs via cross-modal language grounding

Enhancing Sim2Real performance in insertion tasks with preference learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates visual and tactile inputs via language grounding

Uses Direct Preference Optimization for continuous tasks

Achieves high success on unseen shapes with VTLA

🔎 Similar Papers

No similar papers found.

Authors to Follow