PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

📅 2024-06-04
🏛️ arXiv.org
📈 Citations: 83
Influential: 16
📄 PDF
🤖 AI Summary
To address the excessive KV cache memory overhead in long-context reasoning with large language models (LLMs), this work identifies, for the first time, a “pyramidal aggregation” pattern in attention: lower layers focus on local details, while higher layers progressively integrate global semantics. Leveraging this insight, we propose a layer-wise adaptive dynamic KV cache compression method that abandons uniform allocation, instead integrating attention flow analysis, dynamic scheduling, and lightweight token importance estimation. Experiments demonstrate state-of-the-art efficiency–accuracy trade-offs: on LongBench, retaining only 12% of the full KV cache preserves baseline performance; on TREC, using merely 0.7% of the cache boosts accuracy by 20.5 percentage points; and in needle-in-a-haystack evaluation, LLaMA-3-70B achieves 100.0% accuracy with just 128 KV entries. This work establishes a novel paradigm for memory-efficient long-context inference in LLMs.

Technology Category

Application Category

📝 Abstract
In this study, we investigate whether attention-based information flow inside large language models (LLMs) is aggregated through noticeable patterns for long context processing. Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers, progressively consolidating within specific contexts, and ultimately focusing on critical tokens (a.k.a massive activation or attention sink) in higher layers. Motivated by these insights, we developed PyramidKV, a novel and effective KV cache compression method. This approach dynamically adjusts the KV cache size across different layers, allocating more cache in lower layers and less in higher ones, diverging from traditional methods that maintain a uniform KV cache size. Our experimental evaluations, utilizing the LongBench benchmark, show that PyramidKV matches the performance of models with a full KV cache while retaining only 12% of the KV cache, thus significantly reducing memory usage. In scenarios emphasizing memory efficiency, where only 0.7% of the KV cache is maintained, PyramidKV surpasses other KV cache compression techniques, achieving up to a 20.5 absolute accuracy improvement on TREC dataset. In the Needle-in-a-Haystack experiment, PyramidKV outperforms competing methods in maintaining long-context comprehension in LLMs; notably, retaining just 128 KV cache entries enables the LLAMA-3-70B model to achieve 100.0 Acc. performance.
Problem

Research questions and friction points this paper is trying to address.

Investigates attention patterns in LLMs for long context processing
Develops dynamic KV cache compression to reduce memory usage
Improves accuracy in memory-efficient long-context comprehension scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic KV cache compression based on pyramidal funneling
Layer-specific KV cache allocation for efficiency
Retains performance with only 12% KV cache
🔎 Similar Papers
No similar papers found.
Zefan Cai
Zefan Cai
Student, Peking University
Inference AccelerationMulti-Modality
Y
Yichi Zhang
Peking University
Bofei Gao
Bofei Gao
Peking University
Natural Language Processing
Y
Yuliang Liu
Nanjing University
T
Tianyu Liu
Qwen
K
Keming Lu
Qwen
Wayne Xiong
Wayne Xiong
Microsoft
Yue Dong
Yue Dong
University of California Riverside
Artificial IntelligenceNatural Language ProcessingMachine LearningLLM Security
B
Baobao Chang
Peking University
J
Junjie Hu
University of Wisconsin - Madison
W
Wen Xiao
Microsoft