Bridging Embodiment Gaps: Deploying Vision-Language-Action Models on Soft Robots

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language-action (VLA) models are restricted to rigid serial manipulators, failing to generalize to soft continuum robots due to fundamental embodied discrepancies in kinematics, action space, and control dynamics—hindering safe, flexible human-shared environment interaction. Method: We propose a structured fine-tuning framework tailored to soft robotics, addressing these mismatches through kinematic-aware adaptation, action-space alignment, and dynamic-aware control integration. Leveraging OpenVLA-OFT and π₀—two state-of-the-art VLA models—we develop an end-to-end vision-language-action joint fine-tuning and deployment pipeline for soft continuum robots. Contribution/Results: Unadapted VLA policies fail entirely on soft hardware; our fine-tuned models achieve performance parity with rigid-arm baselines across canonical manipulation tasks. This demonstrates, for the first time, successful embodied perception–action co-generalization from rigid to soft robotic platforms, establishing a foundational step toward deployable VLA systems in compliant, human-centric environments.

Technology Category

Application Category

📝 Abstract
Robotic systems are increasingly expected to operate in human-centered, unstructured environments where safety, adaptability, and generalization are essential. Vision-Language-Action (VLA) models have been proposed as a language guided generalized control framework for real robots. However, their deployment has been limited to conventional serial link manipulators. Coupled by their rigidity and unpredictability of learning based control, the ability to safely interact with the environment is missing yet critical. In this work, we present the deployment of a VLA model on a soft continuum manipulator to demonstrate autonomous safe human-robot interaction. We present a structured finetuning and deployment pipeline evaluating two state-of-the-art VLA models (OpenVLA-OFT and $π_0$) across representative manipulation tasks, and show while out-of-the-box policies fail due to embodiment mismatch, through targeted finetuning the soft robot performs equally to the rigid counterpart. Our findings highlight the necessity of finetuning for bridging embodiment gaps, and demonstrate that coupling VLA models with soft robots enables safe and flexible embodied AI in human-shared environments.
Problem

Research questions and friction points this paper is trying to address.

Deploying Vision-Language-Action models on soft robots
Bridging embodiment gaps between rigid and soft robots
Enabling safe human-robot interaction in unstructured environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deploying Vision-Language-Action models on soft robots
Using structured finetuning to bridge embodiment gaps
Coupling VLA models with soft continuum manipulators
🔎 Similar Papers
No similar papers found.
H
Haochen Su
EPFL, Lausanne, Switzerland
C
Cristian Meo
LatentWorlds AI, TUDelft, Delft, Netherlands
F
Francesco Stella
Embodied AI SA, EPFL, Lausanne, Switzerland
A
Andrea Peirone
Embodied AI SA, EPFL, Lausanne, Switzerland
Kai Junge
Kai Junge
PhD Student, EPFL
ManipulationRobot designSoft roboticsEmbodied intelligence
J
Josie Hughes
EPFL, Lausanne, Switzerland