🤖 AI Summary
This work addresses the challenge of parallel generation in diffusion-based large language model inference, where safe batched commitment of masked tokens remains a key bottleneck. The authors propose Fréchet Contour Decoding, a training-agnostic strategy that selects parallel commit sets by ranking full confidence contours, replacing the conventional approach that relies solely on the weakest token’s confidence. This method introduces, for the first time, heterogeneous confidence contours into parallel decoding decisions, generalizing the selection rule of Fast-dLLM and offering provable “heterogeneity gains” under non-uniform confidence distributions. Experiments on the LLaDA-8B model demonstrate throughput improvements of up to 37% on GSM8K, MATH, HumanEval, and MBPP benchmarks while maintaining comparable accuracy.
📝 Abstract
Diffusion large language models promise parallel token generation, yet inference remains bottlenecked by deciding which masked tokens can be safely committed together. Fast-dLLM addressed this with KV caching and confidence-guided parallel decoding, but its decoding theory uses a homogeneous high-confidence assumption that effectively reduces each candidate set to its weakest selected token. We argue that this leaves speed on the table because real decoding steps exhibit heterogeneous confidence profiles. We propose \textbf{Fast-dLLM++}, a training-free extension that introduces \emph{Fréchet profile decoding}: selecting parallel commit sets from the full sorted confidence profile rather than a single worst-case confidence. The resulting rule is a heterogeneous-confidence generalization of Fast-dLLM's factor selector and it recovers the previous rule exactly in the equal-confidence case and adds a provable \emph{heterogeneity bonus} when the selected tokens have uneven confidences. Fast-dLLM++ leaves the model, diffusion process, and cache implementation entirely unchanged, making it a drop-in replacement for existing Fast-dLLM decoding. Experiments on GSM8K, MATH, HumanEval, and MBPP with the LLaDA-8B model show that the theoretical improvement translates directly into empirical gains: profile-aware selection improves the accuracy--throughput frontier by exploiting safe parallelism that weakest-token rules miss, achieving up to 37\% higher throughput at comparable accuracy. Our anonymous code release is at https://github.com/Ringo-Star/FastdLLM_plusplus.