🤖 AI Summary
In long-context LLM inference, KV cache read latency increases significantly with context length, while existing page-level retrieval methods suffer from low precision due to sparse distribution of critical tokens. To address this, we propose a fine-grained KV cache retrieval method: (1) introducing 1-bit quantized keys for token-level importance estimation; (2) integrating dynamic query-relevance matching; and (3) employing importance-score-driven cache management. This approach transcends the coarse granularity of page-level methods, enabling precise identification and retention of sparse, high-value tokens. Experiments demonstrate that our method restores full KV cache performance using only 11% of the original cache budget, reduces decoding latency by 1.2–1.5×, and substantially improves inference efficiency for long-context workloads.
📝 Abstract
The Key-Value (KV) cache reading latency increases significantly with context lengths, hindering the efficiency of long-context LLM inference. To address this, previous works propose retaining a small fraction of KV cache based on token importance. For example, KV eviction uses static heuristics to retain tokens, while KV retrieval dynamically selects query-relevant tokens for more adaptive cache management. However, we observe that important tokens are often sparsely distributed across the long context. This sparsity makes existing page-level KV retrieval inaccurate, as each page may include irrelevant tokens and miss critical ones. In this work, we propose Fier, a underline{Fi}ne-Grained and underline{E}fficient KV cache underline{R}etrieval method. Fier uses 1-bit quantized keys to estimate the importance of each token, resulting in efficient and precise retrieval. Experiments show that Fier matches full KV performance using only 11% of the cache budget across various long-context tasks, reducing decoding latency by 1.2$ imes$ to 1.5$ imes$.