Hold Onto That Thought: Assessing KV Cache Compression On Reasoning

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically evaluates KV cache compression for multi-step reasoning tasks with short prompts and long decoding (e.g., GSM8K, MATH500), focusing on its impact on LLM inference quality and reasoning chain length. It is the first to concentrate specifically on *decoding-phase* compression, proposing a decoding-enhanced variant of SnapKV and benchmarking it against major existing strategies. Results show that hit-rate–driven methods (e.g., H2O, SnapKV) significantly outperform prefill-oriented approaches. Surprisingly, cache budget and reasoning chain length exhibit a counterintuitive negative correlation: certain strategies generate longer and more accurate chains under *lower* cache budgets. Experiments on Llama-3.1-8B-Instruct confirm strong task dependency—no universally optimal compression strategy exists. Key contributions include: (1) establishing the critical importance of decoding-phase KV compression; (2) introducing a lightweight, inference-chain–aware compression paradigm; and (3) discovering a novel cache–reasoning-length trade-off phenomenon.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated remarkable performance on long-context tasks, but are often bottlenecked by memory constraints. Namely, the KV cache, which is used to significantly speed up attention computations, grows linearly with context length. A suite of compression algorithms has been introduced to alleviate cache growth by evicting unimportant tokens. However, several popular strategies are targeted towards the prefill phase, i.e., processing long prompt context, and their performance is rarely assessed on reasoning tasks requiring long decoding. In particular, short but complex prompts, such as those in benchmarks like GSM8K and MATH500, often benefit from multi-step reasoning and self-reflection, resulting in thinking sequences thousands of tokens long. In this work, we benchmark the performance of several popular compression strategies on long-reasoning tasks. For the non-reasoning Llama-3.1-8B-Instruct, we determine that no singular strategy fits all, and that performance is heavily influenced by dataset type. However, we discover that H2O and our decoding-enabled variant of SnapKV are dominant strategies for reasoning models, indicating the utility of heavy-hitter tracking for reasoning traces. We also find that eviction strategies at low budgets can produce longer reasoning traces, revealing a tradeoff between cache size and inference costs.
Problem

Research questions and friction points this paper is trying to address.

Assess KV cache compression on reasoning tasks
Evaluate compression strategies for long decoding sequences
Analyze tradeoff between cache size and inference costs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking KV cache compression strategies
Heavy-hitter tracking for reasoning traces
Tradeoff between cache size and inference costs
🔎 Similar Papers
No similar papers found.
M
Minghui Liu
University of Maryland
A
Aadi Palnitkar
University of Maryland
Tahseen Rabbani
Tahseen Rabbani
Postdoctoral Scholar, University of Chicago
machine learningprivacyefficiency
H
Hyunwoo Jae
University of Maryland
K
Kyle Rui Sang
University of Maryland
D
Dixi Yao
University of Chicago
S
Shayan Shabihi
University of Maryland
Fuheng Zhao
Fuheng Zhao
Snowflake
DatabasesDistributed Systems
T
Tian Li
University of Chicago
C
Ce Zhang
University of Chicago
Furong Huang
Furong Huang
Associate Professor of Computer Science, University of Maryland
Trustworthy AI/MLReinforcement LearningGenerative AI
Kunpeng Zhang
Kunpeng Zhang
HKUST
FuzzingSoftware Testing