Potential-Guided Flow Matching for Vision-Language-Action Policy Improvement

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the challenge of inconsistent trajectory quality in deployed vision–language–action (VLA) policies, which hinders effective imitation learning—behavior cloning replicates failures, filtering discards useful sub-trajectories, and offline reinforcement learning relies on costly external critics. To overcome these limitations, we propose ForesightFlow, a self-guided flow-matching policy that jointly models actions and success-potential trajectories to generate and score action candidates without external critics, enabling best-of-K inference. Key innovations include a decoupled advantage weighting mechanism (applying exponential weights only to action velocities to avoid value hallucination), a single-step boundary estimator for efficient advantage computation via one forward pass, and conditional flow matching with self-guided sampling. Evaluated on five BEHAVIOR-1K simulation tasks and five real-world bimanual robot tasks, ForesightFlow outperforms imitation learning baselines, matches the simulation success rates of the strongest critic-based methods, achieves significantly higher real-world success, and reduces training compute by 38%.

📝 Abstract

Large vision-language-action (VLA) policies are increasingly trained as conditional generative models over action chunks. Yet deployment produces mixed-quality experience-successful demonstrations, partial completions, recoverable mistakes, and failures-that is difficult to use with standard imitation. Full behavior cloning (BC) imitates failures, filtered BC discards useful sub-trajectories, and offline reinforcement learning adds a large critic. We introduce ForesightFlow, a self-guided flow-matching policy that augments each generated action chunk with a learned success-potential trajectory. The same flow proposes and scores candidate actions, enabling best-of-$K$ inference without an external critic. The key issue is that policy improvement and value calibration require different supervision: advantage weighting should emphasize high-quality actions, but applying the same weights to potential coordinates suppresses failure gradients and creates overconfident scores. We address this with decoupled advantage-weighted flow matching, applying exponentiated advantage weights only to action velocities while training potential velocities uniformly. We further derive a one-step boundary estimator for conditional flow matching, allowing advantage computation with a single stop-gradient forward pass. Across five BEHAVIOR-1K simulation tasks and five real-world bimanual tasks, ForesightFlow improves over imitation baselines, matches the strongest separate-critic baseline in simulation success, improves real-world success, and reduces training compute by $38\%$. Ablations show that decoupling prevents value hallucination, the one-step estimator preserves candidate-ranking fidelity, and self-guided sampling improves long-horizon execution.

Problem

Research questions and friction points this paper is trying to address.

vision-language-action

policy improvement

mixed-quality experience

imitation learning

offline reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

flow matching

vision-language-action

self-guided policy