VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation

šŸ“… 2025-05-14
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
This work addresses the challenge of language-conditioned, contact-intensive robotic manipulation (e.g., fingertip insertion), focusing on robust fusion of vision–tactile multimodal signals for language-guided policy learning. We propose the first unified vision–tactile–language–action modeling framework, enabling deep cross-modal perception integration via language alignment. Crucially, we introduce a novel Direct Preference Optimization (DPO) paradigm tailored for continuous control—replacing conventional token-level classification losses. To support training, we construct a low-cost, simulation-generated multimodal instruction dataset (vision–tactile–action–instruction). Experiments demonstrate over 90% success rate on unseen peg-insertion tasks, substantially outperforming diffusion-based policies and TLA/VLA baselines. Moreover, our method exhibits strong sim-to-real transfer capability.

Technology Category

Application Category

šŸ“ Abstract
While vision-language models have advanced significantly, their application in language-conditioned robotic manipulation is still underexplored, especially for contact-rich tasks that extend beyond visually dominant pick-and-place scenarios. To bridge this gap, we introduce Vision-Tactile-Language-Action model, a novel framework that enables robust policy generation in contact-intensive scenarios by effectively integrating visual and tactile inputs through cross-modal language grounding. A low-cost, multi-modal dataset has been constructed in a simulation environment, containing vision-tactile-action-instruction pairs specifically designed for the fingertip insertion task. Furthermore, we introduce Direct Preference Optimization (DPO) to offer regression-like supervision for the VTLA model, effectively bridging the gap between classification-based next token prediction loss and continuous robotic tasks. Experimental results show that the VTLA model outperforms traditional imitation learning methods (e.g., diffusion policies) and existing multi-modal baselines (TLA/VLA), achieving over 90% success rates on unseen peg shapes. Finally, we conduct real-world peg-in-hole experiments to demonstrate the exceptional Sim2Real performance of the proposed VTLA model. For supplementary videos and results, please visit our project website: https://sites.google.com/view/vtla
Problem

Research questions and friction points this paper is trying to address.

Bridging vision-language models for contact-rich robotic manipulation tasks
Integrating visual-tactile inputs via cross-modal language grounding
Enhancing Sim2Real performance in insertion tasks with preference learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates visual and tactile inputs via language grounding
Uses Direct Preference Optimization for continuous tasks
Achieves high success on unseen shapes with VTLA
šŸ”Ž Similar Papers
No similar papers found.
Chaofan Zhang
Chaofan Zhang
Institute of Automation, Chinese Academy of Sciences
tactile perception and robots dexterous manipulation
P
Peng Hao
Samsung R&D Institute China–Beijing
Xiaoge Cao
Xiaoge Cao
Institute of Automation,Chinese Academy of Sciences
Xiaoshuai Hao
Xiaoshuai Hao
Beijing Academy of Artificial Intelligence,BAAI
vision and language
S
Shaowei Cui
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
S
Shuo Wang
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences