🤖 AI Summary
This work investigates the hierarchical decoding dynamics of Masked Diffusion Language Models (MDLMs) in graph-to-text generation, revealing a consistent pattern: entities are generated first, followed by relations and functional words, with structural words finalized last. The study identifies that supervised fine-tuning prematurely fixes structural words at sentence endings, often leading to information omission or hallucination. To address this, the authors propose λ-scaling—a training-agnostic structural decoding strategy—and introduce Graph-LLaDA, a novel architecture integrating a Graph Transformer encoder with the LLaDA decoding framework. Experimental results demonstrate that λ-scaling improves BLEU-4 by 9.4 points, and Graph-LLaDA substantially outperforms overfitted baselines in cross-dataset evaluations on LAGRANGE, exhibiting superior generalization capability.
📝 Abstract
We present the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation. We analyze MDLM generation trajectories -- the order in which tokens are unmasked during iterative decoding -- and find that, unlike autoregressive LLMs which generate text linearly, MDLMs naturally prioritize entities first, followed by relational and function words, with structural tokens resolved last. We further identify a previously undocumented failure mode of supervised fine-tuning: SFT disrupts this strategy by prematurely anchoring structural sentence-ending tokens early in the decoding trajectory, effectively fixing the output length which can lead to omitted or hallucinated information. To address this, we propose lambda-scaled structural decoding, a training-free inference-time modification that downweights structural token confidence and recovers +9.4 BLEU-4. Finally, we introduce Graph-LLaDA, which integrates a Graph Transformer encoder into LLaDA's decoding process to explicitly incorporate relational graph structure. Cross-dataset evaluation on LAGRANGE reveals that previous baselines overfit to dataset-specific patterns, while LLM- and MDLM-based approaches generalize significantly better.