🤖 AI Summary
Existing diffusion models struggle to efficiently sample residual distributions in continuous space, leading to suboptimal performance in speculative decoding. This work proposes Free Drafter, a training-free heuristic self-speculative drafter, and introduces— for the first time—the block verification mechanism from large language models into the diffusion process, enabling efficient primal speculative sampling with parallel verification. The proposed approach substantially increases draft acceptance rates and achieves up to a 6.3% speedup over current speculative methods with negligible additional computational overhead.
📝 Abstract
Speculative decoding speeds up LLM inference by using a draft model to generate tokens, with an acceptance-rejection scheme that ensures that the output matches the target distribution. Adapting this to continuous diffusions is difficult because speculative sampling requires drawing from a residual distribution. While straightforward in discrete spaces, efficiently sampling this residual in continuous space is non-trivial. Consequently, existing diffusion adaptations either use computationally inefficient sampling techniques or rely on an alternative scheme. In this work, we introduce a novel scheme that efficiently implements the original speculative sampling mechanism for diffusion models. Our approach offers a critical advantage over current methods: it enables us to adapt block verification from LLMs to diffusions -- which provably improves the acceptance rate of drafts. Furthermore, we formalize and analyze the Free Drafter, a heuristic self-speculative drafter for diffusions that requires no training. By enabling block verification, our Free Drafter yields up to a 6.3% speedup over existing speculative methods with no additional training and negligible overhead beyond the existing parallel verification pass.