When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Vision–language–action (VLA) models exhibit significant performance degradation under non-English instructions, yet the mechanisms underlying their linguistic robustness remain unclear. This work reveals, for the first time, that VLA models display step-level heterogeneity in their reliance on language and reframes linguistic robustness as a temporally structured control problem. To address this, we propose an inference-time intervention method based on step-sensitive alignment of representations, integrating multilingual instruction translation, fine-grained execution analysis, and step-level representation alignment. Evaluated on the LIBERO multilingual benchmark, our approach substantially mitigates the performance drop caused by language variation—reducing success rate declines by 30–50%—and effectively enhances cross-lingual stability of VLA models.

📝 Abstract

Vision-Language-Action (VLA) models have shown strong performance in language-conditioned robotic manipulation, yet their robustness to linguistic variation remains poorly understood. In this work, we present the first systematic multilingual evaluation of VLA models by translating the LIBERO benchmark into ten languages, revealing severe performance degradation under non-English instructions, with success rates dropping by 30-50%. Through fine-grained analysis of task executions, we find that language influence is highly non-uniform across steps: certain steps exhibit strong language dependence and dominate overall task failure, while others are largely language-agnostic. Based on this insight, we propose a step-wise inference-time intervention that aligns representations according to step language sensitivity, substantially improving performance under linguistic variation. Our results indicate that language robustness in VLA models is fundamentally a step-wise control problem, highlighting the importance of temporally structured analysis for reliable embodied agents.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models

multilingual robustness

language sensitivity

robotic manipulation

instruction understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual evaluation

step-wise language sensitivity

vision-language-action models