Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
Existing ultra-low-bit post-training quantization (PTQ) methods rely on rigid saliency assumptions or heuristic strategies, incurring substantial implicit overhead from scale factors. This work proposes SAGE-PTQ, a framework that statistically separates salient and non-salient weights based on their distribution, modeling the latter as a sparse graph to adaptively determine the optimal per-layer grouping count and saliency ratio. It employs a dual-mode quantization strategy: salient weights retain multi-bit precision while non-salient weights are binarized, with the number of scaling factors jointly optimized. Requiring only 0.004 bits of scale overhead per weight matrix, SAGE-PTQ achieves a WikiText2 perplexity of 6.74 on LLaMA-3-8B—significantly outperforming BiLLM’s 55.8—with an average weight bitwidth of just 1.03. On LLaMA-2-70B, it enables 1.5× decoding speedup on a single L40 GPU while using less than half the memory of BiLLM.
📝 Abstract
Post-training quantization (PTQ) is critical for the efficient deployment of large language models (LLMs). Recent ultra-low-bit PTQ methods rely on rigid weight-saliency assumptions or position heuristics, introducing substantial hidden scaling overhead. We propose SAGE-PTQ (Saliency-Aware Graph-guided Efficient PTQ), a novel ultra-low-bit quantization framework for LLMs that minimizes hidden scaling cost. SAGE-PTQ separates salient and unsalient weights using distributional statistics, then models subsampled unsalient weights as a sparse graph to estimate the optimal number of groups per layer. SAGE-PTQ applies dual-mode quantization, assigning multi-bit precision to salient weights and binarizing unsalient weights. To reduce scaling overhead, SAGE-PTQ uses one per-channel scale for salient weights and one scalar per unsalient group. Finally, SAGE-PTQ implements adaptive saliency thresholding to select the optimal saliency ratio per matrix. SAGE-PTQ achieves 1.03 weight bits and only 0.004 scaling bits per matrix on average, outperforming state-of-the-art methods such as BiLLM and PB-LLM. On LLaMA-3-8B, SAGE-PTQ achieves 6.74 WikiText2 perplexity, compared to 55.8 for BiLLM, while using less than 50% of BiLLM's GPU memory. On LLaMA-2-70B, SAGE-PTQ provides 1.5x faster decoding on one NVIDIA L40 GPU, demonstrating practical inference efficiency.
Problem

Research questions and friction points this paper is trying to address.

post-training quantization
ultra-low-bit quantization
scaling overhead
large language models
weight saliency
Innovation

Methods, ideas, or system contributions that make the work stand out.

ultra-low-bit quantization
graph-guided optimization
saliency-aware quantization
post-training quantization
scaling overhead reduction
🔎 Similar Papers