🤖 AI Summary
This work investigates whether diffusion language models (DLMs) can achieve performance on par with autoregressive language models (ALMs) across general-purpose language tasks through joint scaling of data, model size, and task diversity. To this end, we propose *diffusive adaptation*, an efficient reprogramming framework that transforms pretrained masked language models into DLMs, augmented with instruction tuning to enhance generalization. Our study delivers the first systematic empirical validation that DLMs match ALMs of comparable scale across comprehensive multitask benchmarks; support natural-language-instruction-driven zero-shot and few-shot generalization; and significantly improve higher-order capabilities—particularly complex reasoning. These findings demonstrate that the diffusion paradigm holds comparable potential to the autoregressive paradigm for general-purpose language modeling, challenging the prevailing assumption that autoregression is inherently superior for sequential language generation and understanding.
📝 Abstract
The recent surge of generative AI has been fueled by the generative power of diffusion probabilistic models and the scalable capabilities of large language models. Despite their potential, it remains elusive whether diffusion language models can solve general language tasks comparable to their autoregressive counterparts. This paper demonstrates that scaling diffusion models w.r.t. data, sizes, and tasks can effectively make them strong language learners. We build competent diffusion language models at scale by first acquiring knowledge from massive data via masked language modeling pretraining thanks to their intrinsic connections. We then reprogram pretrained masked language models into diffusion language models via diffusive adaptation, wherein task-specific finetuning and instruction finetuning are explored to unlock their versatility in solving general language tasks. Experiments show that scaling diffusion language models consistently improves performance across downstream language tasks. We further discover that instruction finetuning can elicit zero-shot and few-shot in-context learning abilities that help tackle many unseen tasks by following natural language instructions, and show promise in advanced and challenging abilities such as reasoning.