Utility-Driven Speculative Decoding for Mixture-of-Experts

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Traditional speculative decoding in Mixture-of-Experts (MoE) models suffers from severe weight-loading spikes due to non-uniform expert activation, increasing verification overhead by 2–3× and—paradoxically—raising end-to-end latency by up to 1.5×. Moreover, the optimal speculation length *K* is highly dynamic, varying across tasks, requests, and decoding iterations, rendering static configuration ineffective. This work first identifies the “utility deficit” phenomenon in MoE speculative decoding, where speculative tokens yield diminishing or even negative utility under irregular expert routing. We propose Cascade, a utility-driven dynamic optimization framework: it introduces a lightweight, iteration-locality-aware utility metric to enable fine-grained, adaptive *K* selection and speculative acceptance gating, tightly integrated into vLLM. Evaluated on five mainstream MoE models, Cascade reduces maximum latency by up to 95% (i.e., within 5% of baseline), boosts throughput by 7–14%, and substantially overcomes the practicality bottleneck of speculative decoding for MoE models.

Technology Category

Application Category

📝 Abstract

GPU memory bandwidth is the main bottleneck for low-latency Large Language Model (LLM) inference. Speculative decoding leverages idle GPU compute by using a lightweight drafter to propose K tokens, which the LLM verifies in parallel, boosting token throughput. In conventional dense LLMs, all model weights are fetched each iteration, so speculation adds no latency overhead. Emerging Mixture of Experts (MoE) models activate only a subset of weights per token, greatly reducing data movement. However, we show that speculation is ineffective for MoEs: draft tokens collectively activate more weights, increasing data movement and verification time by 2-3x. When token throughput gains fail to offset this overhead, speculation causes slowdowns up to 1.5x, making it infeasible. Even when useful, the optimal K varies by task, model, and even between requests and iterations. Thus, despite widespread use in dense LLMs, speculation remains impractical in leading MoEs. We present Cascade, a utility-driven framework that selectively enables speculation to avoid slowdowns and dynamically tunes K to accelerate MoE serving. Cascade uses a lightweight metric, speculation utility, the ratio of token gains to verification cost, which shows iteration-level locality, enabling periodic decisions via short test and longer set phases. For each request, Cascade disables speculation if utility drops below one during testing, and when utility exceeds one, tests multiple K-values to choose the utility-maximizing K for the set phase. We implement Cascade in vLLM and evaluate it on five popular MoEs with workloads spanning code, math, extraction, and mixed tasks. Cascade limits slowdown to 5% (vs. 1.5x) and improves throughput by 7-14% over static K, making speculative decoding practical for MoEs.

Problem

Research questions and friction points this paper is trying to address.

Optimize speculative decoding for Mixture-of-Experts (MoE) models

Reduce GPU memory bandwidth bottleneck in MoE inference

Dynamically adjust speculation utility to avoid slowdowns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utility-driven framework for selective speculation

Dynamic tuning of K to maximize utility

Lightweight metric for iteration-level decisions

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions