LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning

📅 2025-06-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the GPU memory bottleneck caused by KV cache explosion in long chain-of-thought (CoT) reasoning, this work first identifies and formalizes the “token importance recurrence” phenomenon—critical tokens periodically regain high attention weights across multiple decoding steps. We propose a lag-based KV eviction mechanism: a sliding observation window dynamically tracks token recurrence intervals, and tokens are retained according to a maximum-recurrence-interval-first policy to preserve periodically important tokens. Our method requires no model architecture modifications or retraining. On mathematical and programming reasoning benchmarks, it reduces KV cache usage by 50% while matching the accuracy of full-cache baselines—and substantially outperforms existing state-of-the-art compression approaches. The core contribution lies in revealing the temporal periodicity of attention patterns in CoT reasoning and establishing the first recurrence-aware KV cache management framework grounded in empirical recurrence dynamics.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) exhibit enhanced reasoning capabilities by employing Chain-of-Thought (CoT). However, the extended reasoning sequences introduce significant GPU memory overhead due to increased key-value (KV) cache size, particularly in tasks requiring long reasoning sequences, such as mathematics and programming. Existing KV cache compression methods mitigate memory bottlenecks but struggle in long reasoning tasks. In this paper, we analyze attention patterns in reasoning tasks and reveal a Token Importance Recurrence phenomenon: a large proportion of tokens receive renewed attention after multiple decoding steps, which is failed to capture by existing works and may lead to unpredictable eviction on such periodically critical tokens. To address this, we propose LazyEviction, a lagged KV eviction framework designed to maintain reasoning performance while reducing KV memory. LazyEviction is an Observation Window-based Lagged Eviction Mechanism retaining latent recurring tokens by performing lagged evictions across decoding steps, which contains two key components: (1) Recurrence Interval Tracking for capturing temporal variations in token importance, and (2) an Maximum Recurrence Interval-Centric Eviction Policy that prioritizes eviction based on tokens'recurrence patterns. Extensive experiments demonstrate that LazyEviction reduces KV cache size by 50% while maintaining comparable accuracy on mathematics reasoning datasets, outperforming state-of-the-art methods. Our findings highlight the importance of preserving recurring tokens, which are critical for maintaining knowledge continuity in multi-step reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

Reduces GPU memory overhead in long reasoning tasks

Addresses unpredictable eviction of critical recurring tokens

Improves KV cache efficiency without losing accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lagged KV eviction with attention pattern observation

Recurrence Interval Tracking for token importance

Maximum Recurrence Interval-Centric Eviction Policy

🔎 Similar Papers

No similar papers found.

Authors to Follow