ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Large reasoning models suffer from GPU memory overflow during long-output generation due to exponential KV cache growth induced by extended chain-of-thought (CoT) prompting. To address this, we propose a thought-adaptive KV cache compression framework that leverages attention sparsity to identify token importance across distinct reasoning stages, enabling dynamic, joint quantization and eviction: critical reasoning tokens are preserved at high precision, while less important ones undergo progressive precision reduction and eventual eviction. We further extend the PagedAttention kernel to support fine-grained memory slot reuse, eliminating cache defragmentation overhead. Evaluated on mathematical and programming benchmarks, our method reduces KV cache memory footprint to just 5% of the baseline while preserving near-lossless accuracy and achieving up to 5.8× higher inference throughput.

Technology Category

Application Category

📝 Abstract

The long-output context generation of large reasoning models enables extended chain of thought (CoT) but also drives rapid growth of the key-value (KV) cache, quickly overwhelming GPU memory. To address this challenge, we propose ThinKV, a thought-adaptive KV cache compression framework. ThinKV is based on the observation that attention sparsity reveals distinct thought types with varying importance within the CoT. It applies a hybrid quantization-eviction strategy, assigning token precision by thought importance and progressively evicting tokens from less critical thoughts as reasoning trajectories evolve. Furthermore, to implement ThinKV, we design a kernel that extends PagedAttention to enable efficient reuse of evicted tokens' memory slots, eliminating compaction overheads. Extensive experiments on DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason across mathematics and coding benchmarks show that ThinKV achieves near-lossless accuracy with less than 5% of the original KV cache, while improving performance with up to 5.8x higher inference throughput over state-of-the-art baselines.

Problem

Research questions and friction points this paper is trying to address.

Compresses KV cache to reduce GPU memory usage in reasoning models

Applies hybrid quantization-eviction strategy based on thought importance

Enables efficient memory reuse while maintaining near-lossless accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid quantization-eviction strategy for KV cache compression

Thought-adaptive token precision assignment by importance

PagedAttention kernel extension enabling efficient memory reuse

🔎 Similar Papers

No similar papers found.