Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

To address the inefficiency of attention computation and the inflexibility of fixed sparsity budgets in long-context large language models, this paper proposes a hierarchical Top-$p$ adaptive attention sparsification framework. It is the first to adapt the Top-$p$ sampling principle to attention sparsification, enabling dynamic, training-free adjustment of sparsity budgets. By integrating hierarchical importance estimation with adaptive key-value (KV) cache compression, the framework seamlessly augments existing sparse attention methods without compromising accuracy. Experiments demonstrate that our approach prunes up to 98% of redundant tokens, accelerates self-attention computation by 15.4×, and reduces end-to-end per-token latency by 3.9×. These gains significantly improve the efficiency–accuracy trade-off for long-context inference.

Technology Category

Application Category

📝 Abstract

Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to $15.4 imes$ acceleration in self-attention operations and $3.9 imes$ acceleration in end-to-end per token latency in long context LLM decoding.

Problem

Research questions and friction points this paper is trying to address.

Adaptive attention sparsity

Long-context large language models

Top-p sampling for sparsity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive sparsity in attention mechanisms

Top-p sampling for dynamic budgeting

Hierarchical pruning enhances LLM efficiency

🔎 Similar Papers

No similar papers found.