🤖 AI Summary
Discrete diffusion models struggle with flexible-length and arbitrarily positioned text completion without explicit positional supervision. To address this, we propose DDOT, the first framework enabling end-to-end, position-label-free dynamic segment-length and location-aware completion within the discrete diffusion paradigm. Its core innovation is a sample-level optimal transport mechanism that jointly models denoising of both token identities and relative positions, preserving sequential order while supporting variable-length generation. DDOT is fully compatible with pretrained text denoisers, requiring no architectural modification—enabling plug-and-play integration. Experiments on the One-Billion-Word and Yelp completion benchmarks demonstrate that DDOT significantly outperforms naive discrete diffusion baselines, achieves performance on par with state-of-the-art non-autoregressive models, and simultaneously improves training efficiency and prompt flexibility.
📝 Abstract
Discrete diffusion models are a new class of text generators that offer advantages such as bidirectional context use, parallelizable generation, and flexible prompting compared to autoregressive models. However, a critical limitation of discrete diffusion models is their inability to perform flexible-length or flexible-position text infilling without access to ground-truth positional data. We introduce extbf{DDOT} ( extbf{D}iscrete extbf{D}iffusion with extbf{O}ptimal extbf{T}ransport Position Coupling), the first discrete diffusion model to overcome this challenge. DDOT jointly denoises token values and token positions, employing a novel sample-level Optimal Transport (OT) coupling. This coupling preserves relative token ordering while dynamically adjusting the positions and length of infilled segments, a capability previously missing in text diffusion. Our method is orthogonal to existing discrete text diffusion methods and is compatible with various pretrained text denoisers. Extensive experiments on text infilling benchmarks such as One-Billion-Word and Yelp demonstrate that DDOT outperforms naive diffusion baselines. Furthermore, DDOT achieves performance on par with state-of-the-art non-autoregressive models and enables significant improvements in training efficiency and flexibility.