🤖 AI Summary
Standard speculative decoding suffers from limited inference acceleration due to the strict serial dependency between draft generation and verification. This work proposes MineDraft, a novel batch-parallel speculative decoding framework that introduces a dual-batch pipelined scheduling mechanism, overlapping draft generation for one batch of requests with the verification phase of another. Furthermore, MineDraft incorporates a cooperative verification protocol between the draft and target models to ensure output correctness while significantly improving hardware utilization. Integrated into the vLLM system, the proposed approach reduces end-to-end latency by up to 39% and achieves a throughput improvement of up to 75% compared to standard speculative decoding.
📝 Abstract
Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD is often limited by the strictly sequential execution of these drafting and verification stages. To address this, this paper proposes MineDraft, a batch parallel speculative decoding (PSD) framework designed to effectively hide drafting latency by overlapping it with verification. Our theoretical analysis shows that PSD is substantially more efficient than standard SD. MineDraft realizes the PSD through a novel batch-parallel design that maintains two batches of requests, overlapping drafting for one batch with verification for the other. Our experimental results show significant improvements of MineDraft in both throughput (up to 75%) and end-to-end latency (up to 39%) over standard SD. Furthermore, we have implemented MineDraft as a plugin for vLLM, demonstrating its practicality for production-ready inference systems.