๐ค AI Summary
This work addresses the instability of low-bit (4-bit) activation quantization under the dynamically varying residual stream across layers, which often leads to quantization collapse, loss of historical information, and degraded model accuracy. The study is the first to uncover the phase-wise emergence of large-magnitude activations in the residual stream and their disruptive effect on quantization scales. To mitigate this, the authors propose a phase-aware dynamic mixed-precision activation quantization strategy that adaptively employs 8-bit precision in sensitive layers while retaining 4-bit elsewhere. Two novel metricsโJump Ratio and Historical Feature SNRโare introduced to guide precision allocation. The resulting framework seamlessly integrates with state-of-the-art post-training quantization methods such as QuaRot and SpinQuant. Experiments demonstrate significant improvements in perplexity and zero-shot question answering for LLaMA-2/3 under W4A4KV4 settings, achieving 1.05โ1.07ร higher inference throughput with minimal memory overhead.
๐ Abstract
Post-training quantization (PTQ) is essential for efficient large language model inference, but reliably quantizing activations remains challenging when weights, activations, and KV caches are all quantized to 4-bit precision. A key difficulty lies in massive activations, whose extreme values dominate the activation range and amplify quantization errors. State-of-the-art methods mainly mitigate massive activations through transformation-based smoothing, such as orthogonal rotations and affine scaling, but overlook the cross-layer dynamics of the residual stream. In this paper, we show that massive activations emerge and disappear in a phase-wise pattern across network depth, triggering large residual changes. These changes cause newly injected layer-wise updates to dominate the 4-bit quantization scale and weaken historical residual information. To characterize this behavior, we introduce Jump Ratio and Historical Feature SNR. This suggests that static transformation-based smoothing cannot fully resolve dynamic quantization instability caused by cross-layer residual changes. Based on this analysis, we propose DynamicPTQ, a Dynamic Post-Training Quantization policy for phase-aware mixed-precision activation quantization. DynamicPTQ identifies quantization-sensitive layers from residual-stream dynamics and assigns 8-bit activation precision only to these layers, while keeping weights, KV caches, and other activations in 4-bit precision. It can be directly integrated with strong PTQ baselines such as QuaRot, SpinQuant, and FlatQuant. Experiments on LLaMA-2 and LLaMA-3 show that DynamicPTQ consistently improves perplexity and zero-shot QA performance under W4A4KV4 quantization, while achieving 1.05 to 1.07 times throughput improvement with modest memory overhead. These results demonstrate a practical path toward robust low-bit LLM inference.