๐ค AI Summary
Masked diffusion language models (MDLMs) suffer from a fundamental training-inference misalignment: chunked decoding outperforms full-diffusion decoding, and autoregressive reinforcement learning algorithms incur trajectory mismatch between rollout and optimization due to non-causality. Method: We propose a two-stage co-optimization framework: (1) EOS Early Rejection and Ascending Step-Size scheduling to enhance the efficiency and stability of full-diffusion decoding; (2) CJ-GRPOโa group-relative policy optimization algorithm with consistency-aware trajectory modelingโto eliminate step-skipping optimization errors. Contribution/Results: Evaluated on LLaDA-8B-Instruct, our approach significantly improves the quality and efficiency of few-step generation on mathematical reasoning and planning tasks, achieving superior trade-offs between step count and performance.
๐ Abstract
Masked diffusion language models (MDLMs) have recently emerged as a promising alternative to autoregressive (AR) language models, offering properties such as parallel decoding, flexible generation orders, and the potential for fewer inference steps. Despite these advantages, decoding strategies and reinforcement learning (RL) algorithms tailored for MDLMs remain underexplored. A naive approach is to directly transfer techniques well-established for AR models to MDLMs. However, this raises an immediate question: Is such a naive transfer truly optimal? For example, 1) Block-wise and semi-AR decoding strategies are not employed during the training of MDLMs, so why do they outperform full diffusion-style decoding during inference? 2) Applying RL algorithms designed for AR models directly to MDLMs exhibits a training-inference inconsistency, since MDLM decoding are non-causal (parallel). This results in inconsistencies between the rollout trajectory and the optimization trajectory. To address these challenges, we propose EOS Early Rejection (EOSER) and Ascending Step-Size (ASS) decoding scheduler, which unlock the potential of MDLMs to perform full diffusion-style decoding, achieving competitive performance with fewer decoding steps. Additionally, we introduce Consistency Trajectory Group Relative Policy Optimization (CJ-GRPO) for taming MDLMs, which emphasizes the consistency between rollout trajectory and optimization trajectory, and reduces the optimization errors caused by skip-step optimization. We conduct extensive experiments on reasoning tasks, such as mathematical and planning benchmarks, using LLaDA-8B-Instruct. The results demonstrate that the proposed EOSER and ASS mechanisms, together with CJ-GRPO, hold significant promise for effectively and efficiently taming MDLMs. Code: https://github.com/yjyddq/EOSER-ASS-RL.