π€ AI Summary
This work addresses the conflation in existing parallel decoding research between algorithmic token utilization and actual system overhead, which obscures an accurate characterization of βnear-zero additional latencyβ parallelism. The paper introduces the concept of Near-Free Parallelism (NFP), explicitly distinguishing algorithmic parallelism from system costs for the first time. By analyzing the behavior of dense feedforward networks, Mixture-of-Experts (MoE), and attention mechanisms under an idle-compute baseline, it reveals that NFP is jointly constrained by memory resource slack and kernel granularity. Leveraging hardware resource balancing and kernel-granularity-aware evaluation, the study establishes a predictive criterion for NFP boundaries, correcting traditional idle-compute intuition that can overestimate NFP by up to 23Γ. Empirical validation across diverse dense and MoE models in both diffusion and autoregressive decoding tasks confirms the accuracy of this framework, offering a reliable system-side budgeting foundation for parallel strategy selection and model-system co-design.
π Abstract
Parallel decoding improves generation efficiency by processing multiple decode positions within a single decode forward, but reported speedups conflate algorithmic token utilization with the system cost of executing multiple positions. We isolate the system side by introducing Near-Free Parallelism (NFP), the maximum number of positions executable at near-free latency. Analyzing Dense FFNs, MoE FFNs, and Attention against an idle-compute baseline, we find that NFP is shaped not by memory-bound resource slack alone, but also by implementation-induced kernel-granularity slack. Based on these mechanisms, we establish a Near-Free Parallelism principle that predicts the NFP boundary from hardware balance and implementation granularity. Validation on representative Dense and MoE models -- spanning both diffusion and autoregressive decoding -- shows that the principle accurately predicts practical NFP boundaries, revealing that the standard idle-compute intuition can over-predict by up to 23x -- offering a system-side budget for parallelism selection and model-system co-design.