🤖 AI Summary
Existing GRPO-based alignment methods suffer from sparse reward signals due to the denoising steps in flow models, which provide preference feedback at only a few timesteps along the trajectory, hindering fine-grained alignment. To address this limitation, this work proposes a zero-cost temporal expansion mechanism that reformulates a single denoising step into multiple sub-trajectories through a principled decomposition of average velocity. This enables full-sequence supervision, propagating reward signals to all intermediate timesteps without incurring additional generation overhead. By significantly expanding the optimization scope and enhancing alignment granularity, the proposed approach consistently achieves superior preference alignment performance across diverse reward settings.
📝 Abstract
Post-training via Group Relative Policy Optimization (GRPO) has emerged as a powerful paradigm for aligning flow-based generative models with human preferences. However, the iterative denoising nature of flow models incurs substantial costs when generating group rollouts for policy-gradient updates, compelling existing methods to train with extremely few denoising steps. This temporal sparsity severely restricts preference optimization: reward feedback can only reach a handful of stages per trajectory, leaving the vast majority of intermediate denoising steps without direct supervision and thus compromising alignment granularity. To address this, we propose Pave-GRPO, which reformulates the GRPO objective through Principled average velocity decomposition. Rather than generating expensive high-step rollouts, we maintain efficient few-step group sampling but decompose each coarse transition into an equivalent ensemble of finer sub-trajectories spanning multiple intermediate timesteps. This propagates reward feedback to a denser set of temporal stages for more comprehensive preference alignment without additional generation cost. This design offers two benefits: (i) zero-cost horizon expansion: through the direct reuse of piece-wise group samples and their associated rewards, Pave-GRPO significantly broadens the effective optimization scope under fixed sampling budgets; and (ii) comprehensive temporal supervision: by equivalently decomposing an instantaneous velocity target into a multi-timestep ensemble, it distributes reward signals across more intermediate stages of the denoising process, enabling finer-grained and more thorough preference optimization. Extensive experiments validate that Pave-GRPO effectively advances preference alignment across different reward settings, offering comprehensive performance enhancement.