🤖 AI Summary
This work addresses the memory and bandwidth bottlenecks caused by the unbounded growth of KV cache in multi-turn agent reasoning. The authors propose a dynamic KV cache pruning method under a frozen large language model, which preserves critical historical information through session-level QueryMemory to model cross-turn intent. Their approach integrates memory-aware attention token scoring, zero-initialized residual cross-attention heads, and a slot redirection strategy that preserves RoPE phase information. Evaluated under a strict 8k KV cache budget, the method reduces peak request tokens by 23.9% for Qwen3-8B and 30.7% for Qwen2.5-14B, while decreasing worst-case KV read volume by up to 92.6%, substantially lowering cache overhead without compromising reasoning accuracy.
📝 Abstract
Multi-turn LLM agents fan short queries into long trajectories of tool calls, search results, and intermediate reasoning. Both KV memory and KV read bandwidth grow by orders of magnitude across a single trajectory, making the key-value (KV) cache, not parameter compute, the dominant serving bottleneck for long-horizon agents. We introduce IntentKV, learned KV pruning that keeps the base LLM frozen. IntentKV maintains a session-level QueryMemory of cross-turn intent, scores live history tokens with a memory-attention rule, and adds a zero-initialized residual head with cross-attention over current-query K-vectors. To stay composable with prefix caches, eviction is a slot-map redirection: dropped positions route to a sentinel dead slot while surviving K/V rows, RoPE phases, and slot identities stay in place. IntentKV matches the no-pruning full-cache baseline with almost no accuracy drop under tight KV budgets: at an 8k KV budget, mean peak request tokens drop 23.9% on Qwen3-8B and 30.7% on Qwen2.5-14B. On the 100 longest BCP queries that all methods complete on Qwen2.5-14B, IntentKV-8k further cuts worst-case peak request tokens from 92.3k to 20.5k, a 77.8% reduction, and worst-case raw KV reads from 411M to 31M, a 92.6% reduction.