Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the instability and catastrophic forgetting in flow-matching models caused by noisy single-sample estimates inherent in existing ratio-clipping policy optimization methods. The authors formulate the denoising process as a Markov decision process and, leveraging the properties of Gaussian policies, derive an exact KL divergence computation for the first time. This KL divergence replaces the probability ratio clipping in Proximal Policy Optimization (PPO), augmented with an asymmetric divergence masking mechanism that dynamically modulates policy updates. The proposed approach substantially improves KL-proximity efficiency, achieves higher rewards in image and video generation tasks, effectively mitigates catastrophic forgetting, enables stable multi-epoch training, and facilitates balanced multi-objective reinforcement learning alignment.

📝 Abstract

Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades. Code and models are available at https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO.

Problem

Research questions and friction points this paper is trying to address.

flow matching

reinforcement learning

policy optimization

ratio clipping

trust region

Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow Matching

Proximal Policy Optimization

KL Divergence