Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the inefficiency in gradient updates caused by process-reward and state-trajectory misalignment in diffusion-based large language models (dLLMs) for reinforcement learning. To resolve these issues, the authors propose the PAPO framework, which introduces two key innovations: Step-aware Process Reward (SPR) and Entropy-guided History Replay (EHR). SPR aligns the reasoning process with the optimization objective through step-level dense rewards, while EHR stabilizes training by replaying ground-truth trajectories at high-uncertainty steps. This approach effectively aligns reinforcement learning updates with generated trajectories, achieving substantial performance gains across four reasoning benchmarks—GSM8K, MATH500, Countdown, and Sudoku—with improvements up to 42.2% over existing methods.

📝 Abstract

Reinforcement learning (RL) holds immense promise for enhancing the reasoning capabilities of diffusion large language models (dLLMs). However, progress is fundamentally constrained by a dual misalignment between authentic generation trajectory and the gradient update process: (i) Process-reward misalignment. Sparse, terminal rewards are indiscriminately assigned to all intermediate steps of the generation process, failing to provide discriminative credit assignment. (ii) State-trajectory misalignment. Policy updates are often diverted toward artificial, out-of-trajectory states, squandering gradients on less informative samples. To address these limitations, we introduce Process Aligned Policy Optimization (PAPO), a novel framework that holistically aligns the RL update with the dLLM's generative trajectory via Step-Aware Process Rewards (SPR) that transform sparse terminal rewards into dense, step-wise credit, and Entropy-Guided Historical Re-enactment (EHR) that replays authentic trajectories at high-uncertainty steps. Extensive experiments on four benchmarks demonstrate that PAPO significantly outperforms baselines, achieving gains of up to 4.5% on GSM8K, 4.8% on MATH500, 42.2% on Countdown and 16.1% on Sudoku.

Problem

Research questions and friction points this paper is trying to address.

process-reward misalignment

state-trajectory misalignment

diffusion large language models

reinforcement learning

credit assignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Process Aligned Policy Optimization

Step-Aware Process Rewards

Entropy-Guided Historical Re-enactment