The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high memory overhead and engineering complexity of traditional Transformer inference, which relies on large key-value (KV) caches. The authors establish, for the first time, that the residual stream is the sole information-carrying state in Transformers and exhibits Markovian properties, enabling lossless reconstruction of all KV pairs. Building on this insight, they propose KV-Direct, a novel inference framework that eliminates the need for KV caching by recomputing activations from the residual stream with a checkpointing mechanism. KV-Direct achieves token-by-token output equivalence with standard inference across six diverse models. It reduces per-token cache size from 136 KB to 5 KB, cuts peak memory usage by over 59% in 20-turn dialogues, and accelerates inference by up to 5×.

Technology Category

Application Category

📝 Abstract
The key-value (KV) cache is widely treated as essential state in transformer inference, and a large body of work engineers policies to compress, evict, or approximate its entries. We prove that this state is entirely redundant: keys and values at every layer are deterministic projections of the residual stream, and recomputing them from a single residual vector per token incurs exactly zero reconstruction error, not approximately, but bit-identically. We verify this across six models from four architecture families (135M to 4B parameters). Cross-task residual patching at every layer produces D_KL = 0 between patched and original output distributions, confirming that the residual stream satisfies a Markov property and is the sole information-carrying state. Removing the cache entirely and recomputing from scratch yields token-identical output under greedy decoding on all models tested. We build on this result with KV-Direct, a bounded-memory inference scheme that checkpoints residual vectors (5 KB per token on Gemma 3-4B) instead of full KV pairs (136 KB), recomputing keys and values on demand. Over 20 conversation turns, KV-Direct holds peak memory at 42 MB while the standard cache grows past 103 MB. Against five eviction baselines (H2O, StreamingLLM, SnapKV, TOVA, window-only), KV-Direct maintains 100% token match at every cache budget; all baselines degrade to 5-28%. A per-operation latency analysis shows recomputation runs up to 5x faster than reading cached tensors at moderate batch sizes. Code is available at https://github.com/Kaleemullahqasim/KV-Direct.
Problem

Research questions and friction points this paper is trying to address.

KV cache
Transformer inference
residual stream
redundancy
memory efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache redundancy
residual stream
bit-identical recomputation
memory-efficient inference
KV-Direct
🔎 Similar Papers
No similar papers found.
Kaleem Ullah Qasim
Kaleem Ullah Qasim
School of Computing and Artificial Intelligence, Southwest Jiaotong University
Reasoning in LLMsPrompt EngineeringLLM Agents
J
Jiashu Zhang
School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, 611756, China
M
Muhammad Kafeel Shaheen
School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, 611756, China
R
Razan Alharith
School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, 611756, China
H
Heying Zhang
School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, 611756, China