🤖 AI Summary
This study challenges the long-standing assumption in the field that KV cache-based inference at FP16 precision is numerically equivalent to cache-free autoregressive inference. Through FP32-controlled experiments, layer-wise drift analysis, residual stream activation patching, and comprehensive evaluations across multiple models (LLaMA-2, Mistral, Gemma) and decoding strategies, the work establishes for the first time that the non-associativity of FP16 floating-point arithmetic is the root cause of systematic—rather than random—output divergence. Experiments reveal 100% token-level disagreement on GSM8K across all models and decoding settings under FP16, while FP32 eliminates divergence entirely. Notably, in 8/9 configurations, enabling KV caching even improves accuracy, confirming the deterministic directionality of the discrepancy. This work provides a mechanistic framework for understanding numerical instability in large language model inference and causally attributes it to the stateful nature of KV caching itself.
📝 Abstract
KV caching is a ubiquitous optimization in autoregressive transformer inference, long presumed to be numerically equivalent to cache-free computation. This assumption fails under standard FP16 precision: cache-ON and cache-OFF execution paths employ different floating-point accumulation orderings which, due to FP16 non-associativity, produce a deterministic divergence in decoded token sequences. Across three open-weight models (LLaMA-2-7B, Mistral-7B-v0.3, Gemma-2-2B) evaluated on GSM8K, we observe a 100\% token divergence rate across all sampling strategies, including greedy decoding, which rules out sampling randomness as a cause, and also with cache-ON yielding higher accuracy in 8 of 9 conditions, where the accuracy difference serves as an indicator that the divergence direction is systematic rather than random. Controlled FP32 falsification reduces divergence by eight orders of magnitude, eliminates token flips, and drops the flip rate to exactly 0.0\%, confirming FP16 non-associativity as the sole causal driver. Layer-wise drift profiling reveals architecturally predictable propagation patterns: models using Grouped-Query Attention exhibit sharp divergence at the first layer, while Gemma's larger head dimension and sliding window attention produce uniform accumulation across all layers. Finally, activation patching of the entire residual stream fails to recover the cache-free trajectory, localizing the causal variable to the stateful KV cache. These findings establish that FP16 KV cache inference is fundamentally non-equivalent to recomputation and provide a mechanistic framework for understanding numerical instability in modern LLM inference systems.