Cost-Aware Diffusion Draft Trees for Speculative Decoding

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
This work addresses the limitations of existing speculative decoding methods, which prioritize maximizing accepted sequence length while neglecting verification overhead and lacking adaptive node budget selection. The authors propose CaDDTree, the first approach that explicitly optimizes draft generation and verification latency with throughput per unit time as the objective, jointly optimizing tree structure and node budget. Theoretical analysis reveals that under convex verification costs, throughput exhibits unimodality, enabling online adaptive budget selection without offline search. By integrating marginal distributions from a diffusion-based draft model, accurate latency modeling, and a greedy stopping rule, CaDDTree achieves efficient inference scheduling. Experiments across eight task categories on Qwen3-4B and Qwen3-8B demonstrate that CaDDTree matches or surpasses the performance of DDTree equipped with an oracle budget.
📝 Abstract
Speculative decoding accelerates inference by having a lightweight drafter propose tokens verified in parallel by the target language model. Block diffusion drafters such as DFlash generate an entire draft block in one pass, yielding per-position marginals; DDTree uses these to build a candidate tree that maximizes expected acceptance length under a fixed node budget. We observe, however, that acceptance length is non-decreasing in budget: it always favors larger trees regardless of verification cost, offering no principled basis for budget selection. We introduce \textbf{CaDDTree} (Cost-aware Diffusion Draft Tree), a method that directly optimizes token throughput (expected tokens generated per unit time) by jointly selecting the tree structure and node budget. We model draft and verification latencies explicitly, show that the throughput objective decomposes into a per-round one-dimensional search over the budget, and prove that under a convex verification cost the throughput function is \emph{unimodal}, enabling an efficient greedy stopping rule. CaDDTree requires no offline budget search, adapting the budget each round from the current per-position distributions and verification cost. Experiments on Qwen3-4B and Qwen3-8B across eight benchmarks spanning reasoning, coding, and instruction-following tasks show that \caDDTree{} matches or surpasses DDTree with oracle budget selection on nearly all tasks.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
draft tree
token throughput
verification cost
budget selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
cost-aware optimization
diffusion draft tree
token throughput
unimodal optimization
🔎 Similar Papers