π€ AI Summary
Flow-based vision-language-action (VLA) models (e.g., Οβ, Οβ.β
) face challenges in large-scale reinforcement learning (RL), including intractable exact action log-likelihood computation and heavy reliance on annotated data. Method: We propose Flow-Noise and Flow-SDEβtwo novel approaches that respectively enable differentiable, exact log-likelihood estimation during iterative denoising and construct an efficient stochastic dynamical framework via ODE-to-SDE conversion. Integrated with learnable noise networks and distributed RL algorithms, our methods support multi-task online training and large-scale parallel simulation. Contribution/Results: On LIBERO and ManiSkill benchmarks, Οβ and Οβ.β
achieve task accuracies of 98.3% and 85.7%, significantly surpassing supervised fine-tuning baselines. Our work marks the first end-to-end, scalable, high-accuracy RL optimization grounded in flow-based generative modeling.
π Abstract
Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying large-scale RL to flow-based VLAs (e.g., $Ο_0$, $Ο_{0.5}$) remains challenging due to intractable action log-likelihoods from iterative denoising.
We address this challenge with $Ο_{ ext{RL}}$, an open-source framework for training flow-based VLAs in parallel simulation. $Ο_{ ext{RL}}$ implements two RL algorithms: (1) {Flow-Noise} models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) {Flow-SDE} integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration.
We evaluate $Ο_{ ext{RL}}$ on LIBERO and ManiSkill benchmarks. On LIBERO, $Ο_{ ext{RL}}$ boosts few-shot SFT models $Ο_0$ and $Ο_{0.5}$ from 57.6% to 97.6% and from 77.1% to 98.3%, respectively. In ManiSkill, we train $Ο_{ ext{RL}}$ in 320 parallel environments, improving $Ο_0$ from 41.6% to 85.7% and $Ο_{0.5}$ from 40.0% to 84.8% across 4352 pick-and-place tasks, demonstrating scalable multitask RL under heterogeneous simulation.
Overall, $Ο_{ ext{RL}}$ achieves significant performance gains and stronger generalization over SFT-models, validating the effectiveness of online RL for flow-based VLAs.