Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective

📅 2025-02-06
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high memory and computational overhead in long-sequence inference of large language models (LLMs) caused by redundancy in key-value (KV) caches. We propose a KV cache criticality quantification method grounded in output perturbation modeling—formally defining the criticality of each KV entry as the worst-case upper bound on output perturbation induced by its removal. Unlike heuristic pruning based on attention weights, our approach reveals the dominant influence of value states and pretrained parameter matrices on perturbation magnitude, and derives an optimal selection algorithm under perturbation constraints. The method is tailored for Llama-family models and achieves state-of-the-art performance on Needle-in-a-Haystack and LongBench benchmarks. It attains lower output perturbation than existing cache eviction methods in over 92% of attention heads, significantly improving the trade-off between inference efficiency and accuracy.

Technology Category

Application Category

📝 Abstract
Large language models have revolutionized natural language processing but face significant challenges of high storage and runtime costs, due to the transformer architecture's reliance on self-attention, particularly the large Key-Value (KV) cache for long-sequence inference. Recent efforts to reduce KV cache size by pruning less critical entries based on attention weights remain empirical and lack formal grounding. This paper presents a formal study on identifying critical KV cache entries by analyzing attention output perturbation. Our analysis reveals that, beyond attention weights, the value states within KV entries and pretrained parameter matrices are also crucial. Based on this, we propose a perturbation-constrained selection algorithm that optimizes the worst-case output perturbation to identify critical entries. Evaluations on the Needle-in-a-Haystack test and Longbench benchmark show our algorithm enhances state-of-the-art cache eviction methods. Further empirical analysis confirms that our algorithm achieves lower output perturbations in over 92% attention heads in Llama model, thereby providing a significant improvement over existing methods.
Problem

Research questions and friction points this paper is trying to address.

Identify critical KV cache entries
Reduce storage and runtime costs
Optimize worst-case output perturbation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Perturbation-constrained selection algorithm
Key-Value cache optimization
Attention output perturbation analysis
🔎 Similar Papers
No similar papers found.
Y
Yuan Feng
School of Computer Science, University of Science and Technology of China (USTC), China; Data Darkness Lab, MIRACLE Center, Suzhou Institute for Advanced Research, USTC, China
Junlin Lv
Junlin Lv
USTC
machine learning system
Y
Yukun Cao
School of Computer Science, University of Science and Technology of China (USTC), China; Data Darkness Lab, MIRACLE Center, Suzhou Institute for Advanced Research, USTC, China
X
Xike Xie
School of Biomedical Engineering, USTC, China; Data Darkness Lab, MIRACLE Center, Suzhou Institute for Advanced Research, USTC, China
S
S. Kevin Zhou
School of Biomedical Engineering, USTC, China; Data Darkness Lab, MIRACLE Center, Suzhou Institute for Advanced Research, USTC, China