Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work reveals that low-bit quantization of key-value (KV) caches, while reducing memory consumption during large language model inference, can silently degrade safety alignment—a failure mode overlooked by existing evaluation protocols. The authors identify that this alignment collapse stems from quantization-induced noise perturbing a low-dimensional, sensitivity-prone activation subspace. To address this, they propose a mechanism-diagnostic-driven mitigation framework featuring Per-Channel Reduction (PCR) to classify three distinct failure modes, coupled with a training-free repair protocol requiring only 20 calibration prompts. Evaluated across 11 models and 5 benchmarks, the method recovers 97.2% of original safety alignment performance on average, with minimal overhead—approximately 35 GPU minutes and negligible memory cost—and seamless compatibility with production-grade quantizers such as KIVI and FP8.

📝 Abstract

Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely focus on measuring perplexity and accuracy without assessing the safety impact. In this study, we explore alignment preservation under KV cache quantization. Across eleven instruction-tuned models (3.8B-72B) and five benchmarks (1,894 prompts), we find that low-bit quantization can silently destroy safety alignment: Mistral-7B loses 15.2% of its refusals at only 1.03x perplexity, and no universal safe bit-width exists, with sharp model-specific phase transitions invisible to standard metrics. We identify that the root cause is geometric: safety features occupy a low-dimensional activation subspace 10^2-10^3x more vulnerable to quantization noise than the full representation space perplexity averages over. Inspired by this observation, we propose Per-Channel Reduction (PCR), a diagnostic that classifies each model into one of three mechanistic failure modes: outlier-crushes-safety, where safety lives in non-outlier channels collaterally damaged by outlier-driven scale factors; outlier-as-safety, where safety overlaps outlier channels and finer granularity cannot rescue it; and multi-layer dilution, where safety is distributed across many layers and per-layer fixes fail. PCR predicts the correct mitigation direction on all nine primary models and one held-out model from an independent family using 20 calibration prompts. PCR generalizes across unseen prompts, models, and production quantizers, including KIVI with up to 97.2% recovery, succeeding where attention-based allocation methods fail. The resulting training-free protocol, requiring approximately 35 GPU-minutes, recovers up to 97% of lost alignment at minimal memory overhead, addressing vulnerabilities confirmed in production vLLM serving with FP8 KV cache on NVIDIA GPUs.

Problem

Research questions and friction points this paper is trying to address.

KV cache quantization

alignment collapse

safety preservation

large language models

quantization noise

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache quantization

alignment collapse

Per-Channel Reduction (PCR)