STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

📅 2026-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of maintaining model accuracy under high compression ratios in traditional key-value (KV) cache compression methods, which often rely on fixed or heuristic rank selection. The authors propose STAR-KV, a framework that enables adaptive rank control at both attention-head and block levels through a differentiable soft-thresholding mechanism. It integrates sensitivity-aware mixed low-rank decomposition, low-rank-aware mixed-precision quantization, near-lossless low-bit quantization guided by data statistics, and custom Triton GPU kernels. Evaluated across multiple large language models, STAR-KV achieves up to 75% KV cache compression, yielding a 20× overall storage reduction when combined with quantization, 6.9× acceleration in attention modules, and a 3.1× improvement in end-to-end generation throughput.
📝 Abstract
Low-rank projection has emerged as a promising approach for compressing the KV cache by exploiting hidden-dimension redundancy. However, prior methods rely on fixed or heuristic rank selection and struggle to achieve aggressive compression with minimal accuracy degradation. We propose STAR-KV, an adaptive low-rank KV cache compression framework with fine-grained rank control. STAR-KV encompasses 1) a differentiable thresholding mechanism that enables optimal rank selection at both attention-head and block levels, 2) a hybrid decomposition strategy that applies different low-rank factorizations according to the sensitivity of key and value projections, and 3) a low-rank-aware mixed precision quantization that leverages data statistics for near lossless low-bit quantization. Evaluated across multiple LLMs and benchmarks, STAR-KV achieves up to 75% KV cache compression and up to 20x overall KV cache reduction when combined with quantization. Enabled by custom Triton-based GPU kernels, STAR-KV delivers up to 6.9x speedup for the attention module and 3.1x end-to-end generation throughput. Our code is publicly available at: https://github.com/PriyanshBhatnagar/STAR-KV.
Problem

Research questions and friction points this paper is trying to address.

KV cache compression
low-rank projection
adaptive rank control
attention mechanism
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

low-rank compression
adaptive rank selection
KV cache
mixed-precision quantization
attention acceleration
🔎 Similar Papers
No similar papers found.