🤖 AI Summary
Existing KV cache mechanisms only support append-only operations or kernel-managed updates, failing to accommodate the need for agent-driven, policy-guided active editing of cached context—leading to inefficient recomputation whenever prefix modifications are required. This work proposes Leyline, a server-side primitive that decouples editing intent from positional correctness maintenance via a declarative four-tuple instruction format, enabling the first externally controllable KV cache editing framework. Leyline’s architecture-agnostic interface, combined with closed-form RoPE rotation correction, supports both in-place concatenation and prefix truncation with refilling, and integrates seamlessly into diverse hardware kernels. Experiments demonstrate that Leyline improves cache hit rates by 11.2 percentage points and reduces latency by up to 241 ms; on debug-gym tasks, it boosts solution success rates by 14.3 percentage points using only ten lines of truncation rules.
📝 Abstract
Modern KV cache management assumes the chatbot workload: prompts arrive once and the cache grows append-only, so prefix caching and forward-only eviction are correct by construction. Agentic LLMs break this assumption. Their conversations evolve through policy-driven editing: failed tool calls are retried, stale outputs dropped, trajectories pivoted. Two distinct cache problems result. First, identical content moves to new positions between turns, invalidating exact-prefix caches even though the underlying KV would still be valid; recent work on position-independent caching for MLA addresses this reuse problem. Second, and this paper's focus, a policy may need to direct the serving system to actively remove or replace a span of cached content and continue without re-prefilling everything that came after. No existing primitive offers this. Production agentic harnesses fall back to re-prefill on every edit, paying full prefix-recomputation cost; kernel-level eviction methods make their own decisions and cannot accept policy directives from outside the kernel. We introduce Leyline, a serving-side primitive that closes this gap. A declarative directive 4-tuple separates what to edit from how to preserve position correctness. The policy declares the edit and its mode (in-place splice or prefix-trimmed re-prefill for semantic forgetting); an architecture-agnostic interface routes to a per-architecture kernel that restores attention math via a closed-form RoPE-rotation correction. The splice kernel lifts replay cache-hit by +11.2 pp and cuts latency by up to 241 ms. A ten-line truncation rule routed through the same interface lifts agentic solve rate by +14.3 pp on debug-gym. The mechanism is open; the policy space it enables is the agenda.