🤖 AI Summary
To address the inefficiency of attention computation and the inflexibility of fixed sparsity budgets in long-context large language models, this paper proposes a hierarchical Top-$p$ adaptive attention sparsification framework. It is the first to adapt the Top-$p$ sampling principle to attention sparsification, enabling dynamic, training-free adjustment of sparsity budgets. By integrating hierarchical importance estimation with adaptive key-value (KV) cache compression, the framework seamlessly augments existing sparse attention methods without compromising accuracy. Experiments demonstrate that our approach prunes up to 98% of redundant tokens, accelerates self-attention computation by 15.4×, and reduces end-to-end per-token latency by 3.9×. These gains significantly improve the efficiency–accuracy trade-off for long-context inference.
📝 Abstract
Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to $15.4 imes$ acceleration in self-attention operations and $3.9 imes$ acceleration in end-to-end per token latency in long context LLM decoding.