Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step

๐Ÿ“… 2025-09-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Masked diffusion language models (MDLMs) suffer from a fundamental training-inference misalignment: chunked decoding outperforms full-diffusion decoding, and autoregressive reinforcement learning algorithms incur trajectory mismatch between rollout and optimization due to non-causality. Method: We propose a two-stage co-optimization framework: (1) EOS Early Rejection and Ascending Step-Size scheduling to enhance the efficiency and stability of full-diffusion decoding; (2) CJ-GRPOโ€”a group-relative policy optimization algorithm with consistency-aware trajectory modelingโ€”to eliminate step-skipping optimization errors. Contribution/Results: Evaluated on LLaDA-8B-Instruct, our approach significantly improves the quality and efficiency of few-step generation on mathematical reasoning and planning tasks, achieving superior trade-offs between step count and performance.

Technology Category

Application Category

๐Ÿ“ Abstract
Masked diffusion language models (MDLMs) have recently emerged as a promising alternative to autoregressive (AR) language models, offering properties such as parallel decoding, flexible generation orders, and the potential for fewer inference steps. Despite these advantages, decoding strategies and reinforcement learning (RL) algorithms tailored for MDLMs remain underexplored. A naive approach is to directly transfer techniques well-established for AR models to MDLMs. However, this raises an immediate question: Is such a naive transfer truly optimal? For example, 1) Block-wise and semi-AR decoding strategies are not employed during the training of MDLMs, so why do they outperform full diffusion-style decoding during inference? 2) Applying RL algorithms designed for AR models directly to MDLMs exhibits a training-inference inconsistency, since MDLM decoding are non-causal (parallel). This results in inconsistencies between the rollout trajectory and the optimization trajectory. To address these challenges, we propose EOS Early Rejection (EOSER) and Ascending Step-Size (ASS) decoding scheduler, which unlock the potential of MDLMs to perform full diffusion-style decoding, achieving competitive performance with fewer decoding steps. Additionally, we introduce Consistency Trajectory Group Relative Policy Optimization (CJ-GRPO) for taming MDLMs, which emphasizes the consistency between rollout trajectory and optimization trajectory, and reduces the optimization errors caused by skip-step optimization. We conduct extensive experiments on reasoning tasks, such as mathematical and planning benchmarks, using LLaDA-8B-Instruct. The results demonstrate that the proposed EOSER and ASS mechanisms, together with CJ-GRPO, hold significant promise for effectively and efficiently taming MDLMs. Code: https://github.com/yjyddq/EOSER-ASS-RL.
Problem

Research questions and friction points this paper is trying to address.

Optimizing decoding strategies for masked diffusion language models
Addressing training-inference inconsistency in reinforcement learning
Reducing decoding steps while maintaining competitive performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

EOS Early Rejection decoding reduces MDLM inference steps
Ascending Step-Size scheduler enables full diffusion-style decoding
Consistency Trajectory Group Relative Policy Optimization aligns training
๐Ÿ”Ž Similar Papers
No similar papers found.