🤖 AI Summary
This work addresses the challenge of efficiently leveraging Q-function gradients to optimize multi-step denoising actions when fine-tuning flow-matching-based vision-language-action (VLA) policies. To this end, the authors propose Q-VGM, a novel approach that introduces a VGG-Flow perspective to reinterpret value gradients as a gradient field over denoising time, which is seamlessly integrated into the flow alignment process. This design circumvents the need for end-to-end backpropagation and explicit action likelihood computation. Coupled with a Cal-QL ensemble Q-network—built upon compact RLT features and layer-wise action injection—Q-VGM enables autonomous policy improvement using only a fixed replay buffer. Evaluated on LIBERO, RoboTwin 2.0, and real-world robotic tasks, the method boosts success rates from 75.0%, 76.4%, and 40.0% to 92.5%, 87.2%, and 67.5%, respectively, significantly outperforming same-architecture baselines.
📝 Abstract
We propose Q-Guided Value-Gradient Matching (Q-VGM), an off-policy reinforcement learning (RL) method that tackles a long-standing challenge in fine-tuning flow-matching vision-language-action (VLA) policies: efficiently improving an expressive flow-matching action expert with respect to a learned Q-function. Effective improvement must exploit the first-order (gradient) information of the critic, but this is difficult for flow policies, because directly back-propagating the value through their multi-step denoising process is numerically unstable at VLA scale, while the tractable action likelihoods required by policy-gradient methods are unavailable under iterative denoising. Existing value-based methods either backpropagate through the full denoising chain, use the critic only at test time without updating the policy, or distill critic-improved actions as terminal labels without supervising the velocity field. Q-VGM sidesteps these issues by leveraging VGG-Flow, a value-gradient view of flow alignment in generative modeling that transforms value gradient into a denoising-time value-gradient field rather than an unstable end-to-end objective. This requires no action likelihoods and no backpropagation through the denoising chain, and operates on a fixed replay buffer. The critic is an action-sensitive Cal-QL ensemble over compact RLT features with per-layer action injection. Q-VGM enables a practical few-shot initialization then learn-from-experience paradigm: starting from a few-shot-SFT pi0.5 VLA, the method leverages self-generated rollout data to substantially improve task performance without additional expert supervision. On LIBERO, Q-VGM raises the average success rate from 75.0% to 92.5%; on RoboTwin 2.0, from 76.4% to 87.2%; and on two real-robot tabletop tasks, from 40.0% to 67.5%, outperforming all same-backbone, same-critic baselines across all three settings.