Fine-Tuning Discrete Diffusion Models with Policy Gradient Methods

📅 2025-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Discrete diffusion models face challenges in applying policy gradient–based reinforcement fine-tuning under non-differentiable, structured rewards. To address this, we propose Score Entropy Policy Optimization (SEPO), the first framework to systematically integrate policy gradients into discrete diffusion modeling. SEPO constructs a differentiable surrogate objective via score matching and incorporates entropy regularization to ensure stable training—yielding theoretically guaranteed convergence and empirical robustness. Crucially, it requires no differentiability of the reward function and supports arbitrary black-box rewards, significantly improving both generation quality and sampling efficiency. Extensive experiments across multiple discrete text generation tasks demonstrate that SEPO consistently outperforms state-of-the-art baselines—including DDPO and REINFORCE variants—while exhibiting strong scalability. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Discrete diffusion models have recently gained significant attention due to their ability to process complex discrete structures for language modeling. However, fine-tuning these models with policy gradient methods, as is commonly done in Reinforcement Learning from Human Feedback (RLHF), remains a challenging task. We propose an efficient, broadly applicable, and theoretically justified policy gradient algorithm, called Score Entropy Policy Optimization (SEPO), for fine-tuning discrete diffusion models over non-differentiable rewards. Our numerical experiments across several discrete generative tasks demonstrate the scalability and efficiency of our method. Our code is available at https://github.com/ozekri/SEPO
Problem

Research questions and friction points this paper is trying to address.

Discrete Diffusion Models
Policy Gradient Methods
Complex Structural Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

SEPO Algorithm
Discrete Diffusion Models
Human Feedback Reinforcement Learning
🔎 Similar Papers
No similar papers found.
Oussama Zekri
Oussama Zekri
ENS Paris-Saclay
Machine LearningGenerative Models
N
Nicolas Boull'e
Department of Mathematics, Imperial College London, UK