From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing approaches to long-context large language model inference often rely on fixed sparsity patterns or uniform computational budgets, overlooking the dynamic disparities among attention heads and across context positions. This work proposes EntropyInfer, a training-free framework that adaptively partitions attention heads into rigid and dynamic categories during the prefill phase based on attention entropy, enabling context-aware allocation of computational resources. During decoding, it introduces an untrained KV cache compression mechanism that preserves critical cached information aligned with the generated content. EntropyInfer achieves fine-grained, context-adaptive inference acceleration without requiring model retraining. Evaluated on Llama, Qwen, and openPangu models, it delivers up to 2.39× end-to-end speedup on sequences exceeding 100k tokens while incurring minimal quality degradation, substantially outperforming baselines such as SnapKV and AdaKV.

📝 Abstract

Existing sparse attention and KV cache compression methods for long-context LLM inference typically apply fixed sparsity patterns or uniform budgets across all attention heads, overlooking the substantial variation in attention behavior among heads and contexts. We observe two distinct entropy patterns among attention heads: Rigid Heads, whose entropy stays near zero across input segments, and Dynamic Heads, whose entropy fluctuates significantly. Crucially, the distribution of these types is context-dependent and cannot be predetermined offline. We therefore propose EntropyInfer, a training-free framework that uses attention entropy to adaptively allocate compute at the granularity of individual heads and segments during prefilling. For decoding, we introduce a latent KV cache compression scheme that leverages generated output tokens, rather than prefill tokens alone, to identify and retain the most critical cache entries. Extensive experiments on Llama, Qwen and openPangu model series show that EntropyInfer consistently outperforms baselines including SnapKV, AdaKV, and CritiPrefill, achieving up to 2.39$\times$ end-to-end speedup beyond 100k tokens with minimal quality degradation compared to full attention. The code is released in https://github.com/SHA-4096/EntropyInfer.

Problem

Research questions and friction points this paper is trying to address.

long-context LLMs

sparse attention

KV cache compression

attention entropy

adaptive inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

entropy-guided inference

adaptive attention

KV cache compression