ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited robustness of existing vision-language-action (VLA) policies in anomalous states and their inability to recover effectively. To tackle this, the authors propose a residual recovery framework that operates under failure conditions: the pretrained VLA policy is frozen, while an external vision-language model (VLM) identifies failure modes and recovery phases, and generates structured semantic rewards to train a lightweight residual policy in simulation. By leveraging the VLM as a high-level semantic reward selector, the approach decouples failure understanding from low-level control and remains compatible with diverse VLA policies. Experiments demonstrate that this method improves simulation success rates from 36.7% to 66.7% and achieves a zero-shot transfer success rate of 61.7% on real-world physical systems, substantially outperforming baseline approaches.
📝 Abstract
Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states requiring targeted recovery. We propose ReCoVLA -- a failure-conditioned residual recovery framework that keeps a pretrained VLA policy frozen, uses an external vision-language model (VLM) to infer the failure mode and recovery stage, and compiles a structured reward from task-relevant components. Rather than using the VLM to generate actions or rewards directly, ReCoVLA uses it as a semantic reward selector: it predicts a recovery descriptor and reward mask for in-simulation residual-policy training, followed by zero-shot sim-to-real deployment of the trained recovery policies. This decouples high-level failure understanding from low-level corrective control to support different VLAs. Experiments across short-horizon, long-horizon, and contact-rich manipulation tasks show that ReCoVLA outperforms the tested baselines on average. In simulation, our reward compiler improves average success from 36.7% for the fine-tuned $π_{0.5}$ baseline to 66.7%. In physical zero-shot sim-to-real experiments, ReCoVLA achieves the best average performance, with 61.7% success.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
Failure Recovery
Robotic Manipulation
Reward Design
Sim-to-Real Transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM-guided reward compilation
failure recovery
vision-language-action policies
residual policy
zero-shot sim-to-real