Aligning Flow Map Policies with Optimal Q-Guidance

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the high inference latency in sequential decision-making caused by multi-step sampling in diffusion or flow matching approaches. To overcome this limitation, the authors propose Flow Map Policies (FMP), which accelerate action generation by learning arbitrary-step transitions—including single-step jumps—within the flow dynamics framework. The method integrates Q-guided trust-region optimization to enable efficient offline-to-online reinforcement learning transfer. Key contributions include the first formulation of flow map policies with an optimal Q-guided objective (FMQ), a closed-form policy update rule, and Q-Guided Beam Search (QGBS) for iterative refinement during inference. Evaluated on 12 tasks from OGBench and RoboMimic, FMP achieves state-of-the-art performance, improving average success rates by 21.3% over the previous single-step policy MVP.

📝 Abstract

Generative policies based on expressive model classes, such as diffusion and flow matching, are well-suited to complex control problems with highly multimodal action distributions. Their expressivity, however, comes at a significant inference cost: generating each action typically requires simulating many steps of the generative process, compounding latency across sequential decision-making rollouts. We introduce flow map policies, a novel class of generative policies designed for fast action generation by learning to take arbitrary-size jumps including one-step jumps-across the generative dynamics of existing flow-based policies. We instantiate flow map policies for offline-to-online reinforcement learning (RL) and formulate online adaptation as a trust-region optimization problem that improves the critic's Q-value while remaining close to the offline policy. We theoretically derive FLOW MAP Q-GUIDANCE (FMQ), a principled closed-form learning target that is optimal for adapting offline flow map policies under a critic-guided trust-region constraint. We further introduce Q-GUIDED BEAM SEARCH (QGBS), a stochastic flow-map sampler that combines renoising with beam search to enable iterative inference-time refinement. Across 12 challenging robotic manipulation and locomotion tasks from OGBench and RoboMimic, FMQ achieves state-of-the-art performance in offline-to-online RL, outperforming the previous one-step policy MVP by a relative improvement of 21.3% on the average success rate.

Problem

Research questions and friction points this paper is trying to address.

generative policies

inference latency

flow matching

sequential decision-making

multimodal action distributions

Innovation

Methods, ideas, or system contributions that make the work stand out.

flow map policies

offline-to-online reinforcement learning

Q-guidance