DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding

πŸ“… 2026-06-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

234K/year
πŸ€– AI Summary
This work addresses the limitations of existing block-wise speculative decoding methods, which rely on a single shared fused representation that constrains draft model scalability and layer-wise expressivity. To overcome this, the authors propose a lightweight, layer-wise independent fusion mechanism that enables each draft layer to adaptively attend to a learnable combination of multiple target model layers’ representations. This design substantially enhances expressiveness with negligible computational overhead and is further empowered by scaling the draft model depth using a large-scale 2.4M training dataset. Evaluated across six benchmarks, the method achieves additional speedups of approximately 11%, 8%, and 5% over DFlash on Qwen3-4B, Qwen3-8B, and GPT-OSS-20B, respectively, yielding end-to-end average speedups of 5.52Γ—, 5.46Γ—, and 3.91Γ—.
πŸ“ Abstract
Block diffusion speculative decoding accelerates LLM inference by predicting all tokens within a block simultaneously for the target model to verify in parallel. Predicting an entire block at once requires a sufficiently capable draft model and effective utilization of the target model's internal knowledge. However, the state-of-the-art method DFlash constrains all draft layers to share a single fused representation derived from only a few target layers, limiting per-layer expressiveness and hindering further scaling of draft capacity. In this paper, we present \modelname, which flares out the narrow conditioning bottleneck of DFlash through a lightweight layer-wise fusion mechanism: each draft layer attends to its own learnable combination of a broad set of target layers at negligible overhead, simultaneously injecting richer target knowledge and providing every draft layer with a distinct input. This enhanced per-layer expressiveness enables scaling the draft model to deeper architectures with consistent gains. We further scale training data from 800K to 2.4M samples to fully exploit the enlarged capacity. On six benchmarks spanning mathematical reasoning, code generation, and conversation, \modelname attains average wall-clock speedups of 5.52x on Qwen3-4B, 5.46x on Qwen3-8B, and 3.91x on GPT-OSS-20B, improving over DFlash by roughly 11\%, 8\%, and 5\% respectively. Our code is available at https://github.com/Tencent/AngelSlim.
Problem

Research questions and friction points this paper is trying to address.

block diffusion speculative decoding
draft model capacity
layer-wise expressiveness
LLM inference acceleration
target model knowledge utilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
block diffusion
layer-wise fusion
draft model scaling
LLM acceleration
πŸ”Ž Similar Papers
J
Jiebin Zhang
School of Computer Science, Peking University
Z
Zhenghan Yu
School of Computer Science, Peking University
S
Song Liu
Tencent
E
Eugene J. Yu
School of Computer Science, Peking University
Zheng Li
Zheng Li
Peking University
δΊΊε·₯智能、θ‡ͺ焢语言倄理
Dawei Zhu
Dawei Zhu
Peking University
long context modelingagentalignment
J
Jiangshan Duo
School of Computer Science, Peking University
Weimin Xiong
Weimin Xiong
Peking University
Computer Science
Yifan Song
Yifan Song
MOE Key Laboratory of Computational Linguistics, Peking University
Language ModelAgent
G
Guanghua Yu
Tencent
J
Jianchen Zhu
Tencent
S
Sujian Li
School of Computer Science, Peking University