🤖 AI Summary
Discrete diffusion models face challenges in applying policy gradient–based reinforcement fine-tuning under non-differentiable, structured rewards. To address this, we propose Score Entropy Policy Optimization (SEPO), the first framework to systematically integrate policy gradients into discrete diffusion modeling. SEPO constructs a differentiable surrogate objective via score matching and incorporates entropy regularization to ensure stable training—yielding theoretically guaranteed convergence and empirical robustness. Crucially, it requires no differentiability of the reward function and supports arbitrary black-box rewards, significantly improving both generation quality and sampling efficiency. Extensive experiments across multiple discrete text generation tasks demonstrate that SEPO consistently outperforms state-of-the-art baselines—including DDPO and REINFORCE variants—while exhibiting strong scalability. The implementation is publicly available.
📝 Abstract
Discrete diffusion models have recently gained significant attention due to their ability to process complex discrete structures for language modeling. However, fine-tuning these models with policy gradient methods, as is commonly done in Reinforcement Learning from Human Feedback (RLHF), remains a challenging task. We propose an efficient, broadly applicable, and theoretically justified policy gradient algorithm, called Score Entropy Policy Optimization (SEPO), for fine-tuning discrete diffusion models over non-differentiable rewards. Our numerical experiments across several discrete generative tasks demonstrate the scalability and efficiency of our method. Our code is available at https://github.com/ozekri/SEPO