FATE-VLA:Failue-aware test generation for vision-language-action models

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing static benchmarks struggle to effectively uncover sparse and clustered failures of vision-language-action (VLA) models in high-dimensional embodied spaces, leading to insufficient robustness evaluation. This work introduces active test generation into VLA assessment for the first time, proposing a failure-oriented dynamic testing framework that adaptively generates high-risk, diverse test cases through the synergy of diversity-guided scene exploration and an agent model trained on execution feedback. Experiments across four prominent VLA models demonstrate that this approach discovers 29.7% more failure cases on average—evidenced by, for instance, a drop in GR00T-N1.6’s success rate from 64.4% to 34.7%—significantly exposing model weaknesses and revealing richer failure modes, thereby advancing the evaluation paradigm toward active and dynamic methodologies.

📝 Abstract

Vision-Language-Action (VLA) models are increasingly used as generalist robot policies, yet their evaluation still relies largely on static benchmarks that randomly sample task scenes. In high-dimensional embodied spaces, failures are sparse and clustered, so static benchmarking can underestimate robustness risks. We reframe VLA evaluation as an active failure-discovery problem and propose a failure-aware test-generation approach that combines diversity-driven exploration with surrogate models learned from observed executions. The method steers testing toward high-risk yet diverse scene regions. Across four state-of-the-art VLA models, it uncovers substantially more failures (up to +29.7 % over selected baselines) while revealing more diverse failure modes. This mean that, for instance, in the case of GR00T-N1.6, success rate dropped from 64.4% to 34.7%. More broadly, our findings call for a shift in VLA evaluation: from passive measurement on fixed task suites to adaptive, failure-seeking test generation that exposes the structure of model weaknesses before deployment.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models

evaluation

failure discovery

robustness

test generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

failure-aware testing

vision-language-action models

active evaluation