Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation

πŸ“… 2025-11-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Vision-language-action (VLA) models are fundamentally limited by purely visual perception, struggling to accurately model contact dynamics and temporal interaction structure during manipulation. To address this, we propose Audio-VLAβ€”the first VLA framework integrating contact audio to enhance understanding of dynamic manipulation processes via tactile-relevant acoustic signals. Methodologically, we fuse DINOv2/SigLIP visual encoders, AudioCLIP audio encoders, and the Llama2 language model, employing LoRA fine-tuning and cross-modal projection for effective multimodal alignment. We further introduce Task Completion Rate (TCR), the first metric explicitly designed to evaluate dynamic interaction perception capability. Extensive experiments on LIBERO, RLBench, and two real-robot manipulation tasks demonstrate that Audio-VLA significantly outperforms vision-only baselines, empirically validating the critical, complementary role of audio in grounding action understanding.

Technology Category

Application Category

πŸ“ Abstract
The Vision-Language-Action models (VLA) have achieved significant advances in robotic manipulation recently. However, vision-only VLA models create fundamental limitations, particularly in perceiving interactive and manipulation dynamic processes. This paper proposes Audio-VLA, a multimodal manipulation policy that leverages contact audio to perceive contact events and dynamic process feedback. Audio-VLA overcomes the vision-only constraints of VLA models. Additionally, this paper introduces the Task Completion Rate (TCR) metric to systematically evaluate dynamic operational processes. Audio-VLA employs pre-trained DINOv2 and SigLIP as visual encoders, AudioCLIP as the audio encoder, and Llama2 as the large language model backbone. We apply LoRA fine-tuning to these pre-trained modules to achieve robust cross-modal understanding of both visual and acoustic inputs. A multimodal projection layer aligns features from different modalities into the same feature space. Moreover RLBench and LIBERO simulation environments are enhanced by adding collision-based audio generation to provide realistic sound feedback during object interactions. Since current robotic manipulation evaluations focus on final outcomes rather than providing systematic assessment of dynamic operational processes, the proposed TCR metric measures how well robots perceive dynamic processes during manipulation, creating a more comprehensive evaluation metric. Extensive experiments on LIBERO, RLBench, and two real-world tasks demonstrate Audio-VLA's superior performance over vision-only comparative methods, while the TCR metric effectively quantifies dynamic process perception capabilities.
Problem

Research questions and friction points this paper is trying to address.

Overcoming vision-only limitations in robotic manipulation perception models
Perceiving contact events and dynamic processes using audio feedback
Addressing lack of systematic evaluation for dynamic operational processes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates contact audio perception into VLA models
Uses pre-trained encoders with LoRA fine-tuning
Introduces Task Completion Rate for dynamic evaluation
πŸ”Ž Similar Papers
No similar papers found.
X
Xiangyi Wei
School of Computer Science and Technology, East China Normal University
H
Haotian Zhang
School of Data Science and Engineering, East China Normal University
X
Xinyi Cao
School of Software Engineering, East China Normal University
S
Siyu Xie
School of Software Engineering, East China Normal University
Weifeng Ge
Weifeng Ge
Fudan University
Humanoid RobotComputer VisionArtificial IntelligenceAI4Science
Y
Yang Li
School of Computer Science and Technology, East China Normal University
C
Changbo Wang
School of Data Science and Engineering, East China Normal University