🤖 AI Summary
This work addresses the high computational cost of large language model inference, which stems from their substantial depth and parameter count. Existing deep pruning methods struggle to flexibly enforce user-specified compute budgets and cannot dynamically adjust inference paths during decoding. To overcome these limitations, we propose BUDDY, a budget-driven dynamic depth routing framework that employs a lightweight, input-aware layer scoring mechanism to deterministically execute the top-k layers under strict compute constraints. BUDDY leverages the key-value cache from the first layer as a global context and fuses it with the latest token representations at each decoding step to enable dynamic rerouting. Our approach is the first to support multiple budget configurations within a single model, enforce strict budget adherence, and perform dynamic routing during decoding, significantly outperforming static pruning baselines on Llama and Qwen model families while achieving superior compute-accuracy trade-offs at comparable performance levels.
📝 Abstract
Large language models (LLMs) incur high inference cost due to their depth and parameter scale. Depth pruning can reduce latency by skipping redundant Transformer blocks, but existing methods (i) provide limited control under user-specific compute budgets and (ii) typically fix the routing path, failing to adapt as the context grows during decoding. We propose Buddy, a budget-driven dynamic depth routing framework. Buddy uses a lightweight Decision Module to score intermediate layers conditioned on the input and deterministically executes the top-k layers to satisfy a given budget. To support decode-time adaptation, Buddy reuses the first-layer KV cache as a low-overhead global context source and pools it together with the newest token representation before each routing decision. When no explicit budget is provided, an optional Budget Predictor estimates an input-dependent compute level to balance quality and efficiency. Experiments on Llama-family and Qwen models show that Buddy is competitive with strong static pruning baselines and often improves the accuracy-compute trade-off, while uniquely supporting strict budget control, decode-time rerouting, and multiple budgets within a single trained model.