SafeVLA-Bench: A Benchmark for the Success-Safety Gap in Vision-Language-Action Models

πŸ“… 2026-05-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

175K/year
πŸ€– AI Summary
Current evaluations of vision-language-action (VLA) models focus solely on task success rates, overlooking safety hazards during execution. This work proposes SafeVLA-Bench, the first systematic safety evaluation framework for VLA systems, which formalizes task-specific safety constraints using Signal Temporal Logic (STL) and introduces two novel metrics: the β€œSuccess-but-Unsafe” ratio and the Violation Severity Index to quantify safety risks. Experiments on LIBERO and RoboCasa-365 reveal that even high-success-rate policies exhibit unsafe execution in 13%–15% of cases, and 36%–56% of successful trajectories violate at least one safety specification, exposing significant safety deficiencies in existing VLA models.
πŸ“ Abstract
Vision-language-action (VLA) benchmarks measure whether a policy completes a requested manipulation task, but binary success can hide safety-relevant trajectory behavior: reaching the goal while applying excessive contact, disturbing bystander objects, destabilizing the held object, or entering robot self-contact. We present SafeVLA-Bench, a post-hoc safety-evaluation framework for existing simulator-based VLA benchmarks. It formalizes task-aware safety requirements as Signal Temporal Logic (STL) specifications and reports native success with two unsafe-success metrics: Succ-But-Unsafe (SBU), the fraction of rollouts that both succeed and violate safety, and Violation Severity Index (VSI), a bounded worst-violation depth score. We instantiate SafeVLA-Bench on LIBERO and RoboCasa-365, evaluating nine policy-benchmark entries across tabletop and kitchen manipulation tasks. High task success does not imply safe execution: high-SR tabletop baselines still leave 13 to 15 percent unsafe-episode rates,and 36 to 56 percent of successful RoboCasa-365 rollouts violate at least one active safety clause. Project page: https://safevla.org.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action Models
Safety Evaluation
Benchmarking
Task Success
Unsafe Behavior
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action Models
Safety Evaluation
Signal Temporal Logic
Benchmarking
Robot Manipulation