Steering Pretrained Drafters during Speculative Decoding

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address low token acceptance rates in speculative decoding caused by hidden-state mismatches between pretrained draft and target models, this paper proposes a lightweight dynamic steering vector mechanism. It computes a steering vector from the target model’s real-time hidden states and injects it into the draft model’s feed-forward layers—enabling online, training-free alignment. The method is plug-and-play, architecture-agnostic, and compatible with diverse pretrained language models. Experiments show a 35% improvement in token acceptance under standard sampling and a 22% gain under greedy sampling, with negligible computational overhead. Its core contribution lies in being the first to introduce hidden-state-driven, real-time steering during the draft generation phase of speculative decoding—effectively mitigating distribution shift between draft and target models. This establishes a novel paradigm for efficient LLM inference without requiring model retraining or architectural modification.

Technology Category

Application Category

📝 Abstract

Speculative decoding accelerates language model inference by separating generation into fast drafting and parallel verification. Its main limitation is drafter-verifier misalignment, which limits token acceptance and reduces overall effectiveness. While small drafting heads trained from scratch compensate with speed, they struggle when verification dominates latency or when inputs are out of distribution. In contrast, pretrained drafters, though slower, achieve higher acceptance rates thanks to stronger standalone generation capabilities, making them competitive when drafting latency is negligible relative to verification or communication overhead. In this work, we aim to improve the acceptance rates of pretrained drafters by introducing a lightweight dynamic alignment mechanism: a steering vector computed from the verifier's hidden states and injected into the pretrained drafter. Compared to existing offline alignment methods such as distillation, our approach boosts the number of accepted tokens by up to 35% under standard sampling and 22% under greedy sampling, all while incurring negligible computational overhead. Importantly, our approach can be retrofitted to existing architectures and pretrained models, enabling rapid adoption.

Problem

Research questions and friction points this paper is trying to address.

Improving token acceptance rates in speculative decoding systems

Addressing drafter-verifier misalignment with dynamic steering mechanism

Enhancing pretrained drafter performance without computational overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight dynamic alignment mechanism for pretrained drafters

Steering vector from verifier injected into drafter

Retrofittable to existing architectures with negligible overhead

🔎 Similar Papers

Cascade Speculative Drafting for Even Faster LLM Inference

2023-12-18Neural Information Processing SystemsCitations: 52

Authors to Follow