Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing ultra-low-bit post-training quantization (PTQ) methods rely on rigid saliency assumptions or heuristic strategies, incurring substantial implicit overhead from scale factors. This work proposes SAGE-PTQ, a framework that statistically separates salient and non-salient weights based on their distribution, modeling the latter as a sparse graph to adaptively determine the optimal per-layer grouping count and saliency ratio. It employs a dual-mode quantization strategy: salient weights retain multi-bit precision while non-salient weights are binarized, with the number of scaling factors jointly optimized. Requiring only 0.004 bits of scale overhead per weight matrix, SAGE-PTQ achieves a WikiText2 perplexity of 6.74 on LLaMA-3-8B—significantly outperforming BiLLM’s 55.8—with an average weight bitwidth of just 1.03. On LLaMA-2-70B, it enables 1.5× decoding speedup on a single L40 GPU while using less than half the memory of BiLLM.

📝 Abstract

Post-training quantization (PTQ) is critical for the efficient deployment of large language models (LLMs). Recent ultra-low-bit PTQ methods rely on rigid weight-saliency assumptions or position heuristics, introducing substantial hidden scaling overhead. We propose SAGE-PTQ (Saliency-Aware Graph-guided Efficient PTQ), a novel ultra-low-bit quantization framework for LLMs that minimizes hidden scaling cost. SAGE-PTQ separates salient and unsalient weights using distributional statistics, then models subsampled unsalient weights as a sparse graph to estimate the optimal number of groups per layer. SAGE-PTQ applies dual-mode quantization, assigning multi-bit precision to salient weights and binarizing unsalient weights. To reduce scaling overhead, SAGE-PTQ uses one per-channel scale for salient weights and one scalar per unsalient group. Finally, SAGE-PTQ implements adaptive saliency thresholding to select the optimal saliency ratio per matrix. SAGE-PTQ achieves 1.03 weight bits and only 0.004 scaling bits per matrix on average, outperforming state-of-the-art methods such as BiLLM and PB-LLM. On LLaMA-3-8B, SAGE-PTQ achieves 6.74 WikiText2 perplexity, compared to 55.8 for BiLLM, while using less than 50% of BiLLM's GPU memory. On LLaMA-2-70B, SAGE-PTQ provides 1.5x faster decoding on one NVIDIA L40 GPU, demonstrating practical inference efficiency.

Problem

Research questions and friction points this paper is trying to address.

post-training quantization

ultra-low-bit quantization

scaling overhead

large language models

weight saliency

Innovation

Methods, ideas, or system contributions that make the work stand out.

ultra-low-bit quantization

graph-guided optimization

saliency-aware quantization