LeanK: Learnable K Cache Channel Pruning for Efficient Decoding

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

To address high GPU memory consumption and slow decoding in large language models (LLMs) during long-context inference—caused by KV cache bloat—this paper proposes a learnable, channel-level key cache pruning method. The approach introduces trainable static channel masks and a two-stage training strategy to achieve fine-grained sparsity while preserving hardware alignment. It further identifies the importance distribution patterns of attention heads and channels across long contexts. A customized sparse attention decoding kernel is developed to accelerate computation. Experiments demonstrate that the method reduces key cache usage by up to 70% and value cache by 16–18%, with zero precision loss. Attention computation speed improves by 1.3×, significantly enhancing long-context inference efficiency.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. We propose LeanK, a learning-based method that prunes unimportant key (K) cache channels by leveraging static channel sparsity. With a novel two-stage training process, LeanK learns channel-wise static mask that could satisfy specific sparsity ratio and hardware alignment requirement. LeanK reduces GPU memory and accelerates decoding without sacrificing accuracy. Experiments demonstrate up to 70% K cache and 16%-18% V cache memory reduction. Custom decoding kernel enables 1.3x speedup for attention computation. We also provide insights into model channels and attention heads during long-context inference by analyzing the learned importance distribution. Our code is available at https://aka.ms/LeanK.

Problem

Research questions and friction points this paper is trying to address.

Reduces KV cache memory in LLMs for efficiency

Prunes unimportant K cache channels via learnable masks

Accelerates decoding without accuracy loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning-based K cache channel pruning

Two-stage training for static masks

Custom kernel speeds up attention computation

🔎 Similar Papers

No similar papers found.