IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

📅 2026-06-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the memory and bandwidth bottlenecks caused by the unbounded growth of KV cache in multi-turn agent reasoning. The authors propose a dynamic KV cache pruning method under a frozen large language model, which preserves critical historical information through session-level QueryMemory to model cross-turn intent. Their approach integrates memory-aware attention token scoring, zero-initialized residual cross-attention heads, and a slot redirection strategy that preserves RoPE phase information. Evaluated under a strict 8k KV cache budget, the method reduces peak request tokens by 23.9% for Qwen3-8B and 30.7% for Qwen2.5-14B, while decreasing worst-case KV read volume by up to 92.6%, substantially lowering cache overhead without compromising reasoning accuracy.

📝 Abstract

Multi-turn LLM agents fan short queries into long trajectories of tool calls, search results, and intermediate reasoning. Both KV memory and KV read bandwidth grow by orders of magnitude across a single trajectory, making the key-value (KV) cache, not parameter compute, the dominant serving bottleneck for long-horizon agents. We introduce IntentKV, learned KV pruning that keeps the base LLM frozen. IntentKV maintains a session-level QueryMemory of cross-turn intent, scores live history tokens with a memory-attention rule, and adds a zero-initialized residual head with cross-attention over current-query K-vectors. To stay composable with prefix caches, eviction is a slot-map redirection: dropped positions route to a sentinel dead slot while surviving K/V rows, RoPE phases, and slot identities stay in place. IntentKV matches the no-pruning full-cache baseline with almost no accuracy drop under tight KV budgets: at an 8k KV budget, mean peak request tokens drop 23.9% on Qwen3-8B and 30.7% on Qwen2.5-14B. On the 100 longest BCP queries that all methods complete on Qwen2.5-14B, IntentKV-8k further cuts worst-case peak request tokens from 92.3k to 20.5k, a 77.8% reduction, and worst-case raw KV reads from 411M to 31M, a 92.6% reduction.

Problem

Research questions and friction points this paper is trying to address.

KV cache

multi-turn agents

inference bottleneck

memory efficiency

long-horizon reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache pruning

intent-aware

cross-turn memory