Hybrid Verified Decoding: Learning to Allocate Verification in Speculative Decoding

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the inefficiency of autoregressive generation in large language models and the instability of existing speculative decoding methods under fluctuating draft acceptance rates. To this end, the authors propose a hybrid verification decoding mechanism that, for the first time, incorporates benefit estimation to guide draft source selection. The approach dynamically integrates cache matching with model-based drafting by predicting the acceptable length of cached drafts, thereby prioritizing high-yield draft segments to minimize sequential decoding overhead. It also reveals how prompt structure influences caching opportunities. Experimental results demonstrate that the method achieves an average speedup of 2.73× across three mainstream large language models and sixteen datasets, consistently outperforming EAGLE-3 in agent-based workflows.

📝 Abstract

Large Language Model (LLM) generation remains expensive because autoregressive decoding calls the model once for each new token. Speculative decoding reduces this cost by drafting multiple tokens and verifying them with the target model in one step, but its speedup depends on how many drafted tokens are accepted. Parameter-free draft sources can propose long continuations at low cost in structured and agentic workloads, yet a cache match that looks promising at one generation step may have low payoff at the next. We propose Hybrid Verified Decoding, which predicts the accepted length of a cache draft before verification and uses this payoff estimate to choose between cache verification and a model-based drafter. Across three LLMs and sixteen datasets, Hybrid Verified Decoding is especially effective on agentic workflows, where it outperforms EAGLE3 in every setting with a 2.73x average speedup. Our analysis shows how prompt structure creates cache opportunities, how high-payoff cache drafts concentrate in a small part of the draft space, and how payoff-guided selection reduces sequential decoding work, pointing to runtime draft selection as a promising direction for speculative decoding.

Problem

Research questions and friction points this paper is trying to address.

speculative decoding

large language model

cache verification

draft acceptance

generation efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative Decoding

Hybrid Verified Decoding

Cache Drafting