🤖 AI Summary
Diffusion language models (DLMs) suffer from “update forgetting”: uniform, context-agnostic token-level timestep updates overwrite early semantic edits, degrading text coherence and impeding cumulative refinement. This work formally characterizes the phenomenon and introduces a soft semantic token ordering mechanism that dynamically freezes semantically critical tokens while continuously optimizing uncertain ones—enabling fine-grained, controllable inference-time text generation. The method supports both fixed and task-signal-driven adaptive token-level update scheduling. Experiments demonstrate substantial improvements: in sentiment control, accuracy increases by over 20%, perplexity drops by 48%, and sampling steps decrease by 80%; in detoxification, maximum toxicity reduces from 14.5 to 12.2, and perplexity declines from 32.0 to 26.0. To our knowledge, this is the first work to formalize and mitigate update forgetting in DLMs via semantics-aware token scheduling.
📝 Abstract
While diffusion language models (DLMs) enable fine-grained refinement, their practical controllability remains fragile. We identify and formally characterize a central failure mode called update forgetting, in which uniform and context agnostic updates induce token level fluctuations across timesteps, erasing earlier semantic edits and disrupting the cumulative refinement process, thereby degrading fluency and coherence. As this failure originates in uniform and context agnostic updates, effective control demands explicit token ordering. We propose Token Timestep Allocation (TTA), which realizes soft and semantic token ordering via per token timestep schedules: critical tokens are frozen early, while uncertain tokens receive continued refinement. This timestep based ordering can be instantiated as either a fixed policy or an adaptive policy driven by task signals, thereby supporting a broad spectrum of refinement strategies. Because it operates purely at inference time, it applies uniformly across various DLMs and naturally extends to diverse supervision sources. Empirically, TTA improves controllability and fluency: on sentiment control, it yields more than 20 percent higher accuracy and nearly halves perplexity using less than one fifth the steps; in detoxification, it lowers maximum toxicity (12.2 versus 14.5) and perplexity (26.0 versus 32.0). Together, these results demonstrate that softened ordering via timestep allocation is the critical lever for mitigating update forgetting and achieving stable and controllable diffusion text generation.