🤖 AI Summary
This work addresses the challenge of rapidly growing KV cache in large language models during long-chain reasoning, where existing decoding-stage compression methods—often relying on static, uniform allocation—fail to adapt to dynamic contextual demands. The authors propose ReasonAlloc, a training-free, hierarchical KV cache budget allocation framework that, for the first time, formulates decoding-stage cache compression as a hierarchical resource allocation problem. By uncovering a “reasoning wave” architectural pattern, ReasonAlloc integrates offline inter-layer pre-allocation with online head-level dynamic reallocation, achieving substantial gains in cache efficiency with nearly zero additional overhead. Experiments demonstrate that ReasonAlloc significantly outperforms R-KV, SnapKV, and Pyramid-RKV on mathematical reasoning benchmarks such as MATH-500 and AIME 2024, with the largest performance improvements observed under tight cache budgets of 128–512 tokens.
📝 Abstract
Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth. Current decoding-time compression methods mitigate this issue via token eviction, but typically assume a uniform budget distribution across all layers and heads. In contrast, existing non-uniform budget allocation methods are predominantly designed for the static prompt prefill phase, and they do not capture the stepwise context demands of autoregressive reasoning. To bridge this gap, we propose ReasonAlloc, a training-free framework that recasts decoding-time KV compression as a hierarchical budget allocation problem. ReasonAlloc operates at two complementary levels: an offline layer-wise preallocation strategy captures an architecture-driven demand pattern which we call ``\textit{Reasoning Wave}'', while an online head-wise strategy reallocates resources during decoding to information-rich heads based on real-time utility. Evaluations on mathematical reasoning benchmarks (MATH-500, AIME~2024) using DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Qwen-14B, and AceReason-14B show that ReasonAlloc outperforms uniform-budget R-KV, SnapKV, and Pyramid-RKV (a baseline enforcing a static, monotonically decreasing layer budget), with the largest gains at small budgets (128-512 tokens). ReasonAlloc is plug-and-play with existing token-eviction policies and introduces negligible inference-time overhead.