OffQ: Taming Structured Outliers in LLM Quantization by Offsetting

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the severe performance degradation of large language models under low-bit quantization caused by activation outliers. To mitigate this issue, the authors propose a novel outlier concentration method based on Top-1 PCA and rotation: they first identify the dominant outlier subspace, then apply a rotation to concentrate high-magnitude activations into a single channel, which is subsequently modeled as a shared offset to reduce the overall activation standard deviation. This approach is the first to explicitly represent outlier magnitudes as a shared offset, enabling efficient uniform grid quantization in a W4A4KV4 configuration. Experimental results demonstrate that the proposed method consistently outperforms existing low-bit quantization techniques across multiple large language models and benchmarks, achieving higher accuracy while maintaining computational efficiency.

📝 Abstract

Low-bit quantization has been widely adopted to accelerate the inference of large language models (LLMs) by significantly reducing computational cost and memory usage. However, activation outliers pose a major challenge to effective quantization, often leading to notable performance degradation. In this paper, we introduce OffQ, a method designed to mitigate activation outliers in low-bit quantization through a novel offsetting mechanism. Specifically, OffQ first identifies a low-dimensional outlier subspace in the activations using a proposed top-1 PCA, and then concentrates high-magnitude activations into 1 channel via rotation. OffQ then absorbs this concentrated outlier channel by converting its magnitude into a shared offset, thereby reducing the standard deviation of the activations. This offsetting strategy enables effective W4A4KV4 quantization of LLMs using deployment-friendly uniform-grid and uniform-precision quantization. Extensive experiments across diverse LLM architectures and benchmarks demonstrate that OffQ outperforms state-of-the-art baselines, consistently improving model accuracy while preserving low-bit efficiency.

Problem

Research questions and friction points this paper is trying to address.

LLM quantization

activation outliers

low-bit quantization

structured outliers

quantization degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

outlier mitigation

low-bit quantization

activation offsetting