Fine-Tuning Discrete Diffusion Models with Policy Gradient Methods

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Discrete diffusion models face challenges in applying policy gradient–based reinforcement fine-tuning under non-differentiable, structured rewards. To address this, we propose Score Entropy Policy Optimization (SEPO), the first framework to systematically integrate policy gradients into discrete diffusion modeling. SEPO constructs a differentiable surrogate objective via score matching and incorporates entropy regularization to ensure stable training—yielding theoretically guaranteed convergence and empirical robustness. Crucially, it requires no differentiability of the reward function and supports arbitrary black-box rewards, significantly improving both generation quality and sampling efficiency. Extensive experiments across multiple discrete text generation tasks demonstrate that SEPO consistently outperforms state-of-the-art baselines—including DDPO and REINFORCE variants—while exhibiting strong scalability. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Discrete diffusion models have recently gained significant attention due to their ability to process complex discrete structures for language modeling. However, fine-tuning these models with policy gradient methods, as is commonly done in Reinforcement Learning from Human Feedback (RLHF), remains a challenging task. We propose an efficient, broadly applicable, and theoretically justified policy gradient algorithm, called Score Entropy Policy Optimization (SEPO), for fine-tuning discrete diffusion models over non-differentiable rewards. Our numerical experiments across several discrete generative tasks demonstrate the scalability and efficiency of our method. Our code is available at https://github.com/ozekri/SEPO

Problem

Research questions and friction points this paper is trying to address.

Discrete Diffusion Models

Policy Gradient Methods

Complex Structural Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

SEPO Algorithm

Discrete Diffusion Models

Human Feedback Reinforcement Learning

🔎 Similar Papers

Bellman Diffusion Models