🤖 AI Summary
Existing vision-language-action (VLA) models exhibit insufficient robustness under lighting variations, viewpoint shifts, or initial perturbations, often leading to grasp-and-place failures. This work proposes a training-free, runtime intervention framework that corrects execution failures without modifying model weights. By integrating a lightweight multi-object latent state probe, an object-agnostic motion state machine, and hierarchical control barrier functions (CBFs), the method enables real-time failure detection and recovery. It achieves, for the first time, a plug-and-play, training-agnostic failure-recovery mechanism for pretrained VLA policies. Combined with Hungarian-matching-based identity tracking and safety-aware action filtering, the approach boosts the success rate of OpenVLA-OFT from 69.6% to 74.1% on the LIBERO-plus benchmark, substantially enhancing system robustness.
📝 Abstract
Vision-Language-Action (VLA) models demonstrate strong perfor-1 mance on language-conditioned robotic manipulation within their training dis-2 tribution, yet their generalization capabilities remain fundamentally limited. They3 lack the robustness required to handle perturbations, frequently failing when con-4 fronted with lighting changes, altered camera viewpoints, or small initial-state5 variations. We propose PROBEACT, a training-free runtime intervention frame-6 work that detects and recovers from grasping and placement failures in pre-7 trained VLA policies without modifying their weights or requiring additional8 demonstrations. PROBEACT combines three components: (i) a lightweight multi-9 target hidden-state probe that predicts the 3D positions of task-relevant objects10 from intermediate VLA features, with Hungarian-matched identity tracking for11 multi-object scenes; (ii) an object-agnostic kinematic state machine that detects12 grasp, transport, and placement failures using only gripper-internal signals and13 end-effector kinematics; and (iii) a hierarchical Control Barrier Function (CBF)14 filter that encodes repeated-failure locations as soft safe-set constraints, mini-15 mally correcting VLA actions while preserving baseline behavior. As a plug-and-16 play, training-free intervention loop, PROBEACT is orthogonal to existing train-17 ing pipelines. Evaluated on the LIBERO-plus benchmark, our framework acts as18 a universal safety net, improving the success rate of the OpenVLA-OFT model19 from 69.6% to 74.1%, while demonstrating broad applicability to both base and20 fine-tuned VLA policies.