Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the challenge of real-time detection of local failures in vision-language-action (VLA) models during task execution. The authors propose a coarse-grained supervised learning approach that requires only trajectory-level labels, yet is capable of uncovering fine-grained failure signals without step-level annotations—a first in the field. By integrating inter- and intra-trajectory contrastive learning with shape-preserving prediction, the method accurately identifies critical failure actions and generates temporally structured signals. Experimental results demonstrate state-of-the-art performance across multiple tasks on the LIBERO and VLABench benchmarks, as well as on a real robotic platform. The approach achieves a practical trade-off between accuracy and real-time responsiveness while exhibiting strong generalization to both seen and unseen tasks.

📝 Abstract

Vision-Language-Action (VLA) models enable robots to follow natural language instructions and generalize across diverse tasks, but they remain vulnerable to execution failures that compromise reliability in real-world deployment. Detecting such failures during execution is therefore critical for the robust deployment of embodied systems. Existing failure detection methods either rely on expensive action resampling or external models, while alternatives propagate trajectory-level labels uniformly across every timestep, obscuring localized failure signals. In this paper, we propose \textbf{Hide-and-Seek}, a framework that formulates VLA failure detection as a coarsely supervised learning problem. By combining inter-trajectory and intra-trajectory contrastive objectives, Hide-and-Seek localizes failure-indicative actions and induces temporally structured failure signals from trajectory-level supervision alone, without any step-level annotation. We evaluate Hide-and-Seek on LIBERO, VLABench, and a real-world robotic platform across three representative VLA policies: OpenVLA, $π_0$, and $π_{0.5}$.Our method achieves state-of-the-art multi-task failure detection performance with a practical accuracy--timeliness trade-off under conformal prediction, and generalizes well to both seen and unseen tasks.

Problem

Research questions and friction points this paper is trying to address.

failure detection

Vision-Language-Action

runtime monitoring

trajectory-level supervision

embodied AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

failure detection

Vision-Language-Action (VLA)

coarse-to-fine supervision