PatternKV: Flattening KV Representation Expands Quantization Headroom

📅 2025-10-05

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address memory and bandwidth bottlenecks induced by KV caching in autoregressive large language model (LLM) inference—particularly under long-context and test-time scaling scenarios where low-bit quantization severely degrades accuracy—this paper proposes pattern-aligned residual quantization. Our method dynamically identifies structured pattern vectors within the KV cache online, then aligns and quantizes their residual components, synergistically combined with distribution reshaping to significantly mitigate distribution shift under ultra-low-bit quantization. Unlike conventional outlier-isolation approaches, ours yields flatter and more compact quantized distributions. Experiments across multiple backbone models demonstrate that 4-bit quantization incurs only a 0.08% accuracy drop, while 2-bit quantization even surpasses the full-precision baseline. Under test-time scaling, accuracy improves by 10%, throughput increases by 1.4×, and batch size scalability rises by 25%.

Technology Category

Application Category

📝 Abstract

KV cache in autoregressive LLMs eliminates redundant recomputation but has emerged as the dominant memory and bandwidth bottleneck during inference, notably with long contexts and test-time scaling. KV quantization is a key lever for reducing cache cost, but accuracy drops sharply as the native KV distribution lacks flatness and thus maintains a wide quantization range. Prior work focuses on isolating outliers, which caps their error but fails to flatten the overall distribution, leaving performance fragile under low-bit settings. In this work, we show that the K cache maintains a stable structure that evolves gradually with context, while the V cache carries latent semantic regularities. Building on these insights, we propose PatternKV, a pattern-aligned residual quantization scheme. It mines representative pattern vectors online, aligns each KV vector to its nearest pattern, and quantizes only the residual. This reshaping of the KV distribution flattens the quantization target and narrows its range, thereby improving the fidelity of low-bit KV quantization. Across long-context and test-time scaling settings on multiple backbones, PatternKV delivers consistent 2-bit gains, with a 0.08% average 4-bit drop relative to FP16, improves test-time scaling accuracy by 10% on average, and raises throughput by 1.4x while supporting 1.25x larger batches.

Problem

Research questions and friction points this paper is trying to address.

KV cache causes memory bottleneck in LLM inference

Native KV distribution lacks flatness for quantization

Prior quantization methods fail under low-bit settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

PatternKV mines representative pattern vectors online

Aligns each KV vector to its nearest pattern

Quantizes only the residual for improved fidelity

🔎 Similar Papers

No similar papers found.