Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization

๐Ÿ“… 2026-04-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the severe performance degradation of existing additive quantization methods under extreme 2-bit quantization, which stems from poor codebook initialization and hinders recovery during optimization. The study reveals the critical role of initialization in the optimization geometry of compressed models and proposes an output-aware EM initialization method (OA-EM). OA-EM leverages the representational ratio (ฯ = N/KM) to analyze the relationship between weight grouping and codebook capacity, and constructs superior initial codebooks using a Hessian-weighted Mahalanobis distance. Combined with PV fine-tuning and beam search, OA-EM substantially improves 2-bit quantization performance on Llama and Qwen model families, establishing a new Pareto frontier in qualityโ€“compute efficiency under ultra-low-bit settings.
๐Ÿ“ Abstract
Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit precision, it often fails catastrophically, even with extensive search and finetuning. We show that the dominant bottleneck is codebook initialisation. Greedy sequential initialisation frequently places the model in poor optimisation regions that subsequent beam search and PV-tuning struggle to overcome. We analyse this behaviour through the representational ratio \r{ho} = N/KM, which characterises the relationship between weight groups and codebook capacity, and propose OA-EM, an output-aware EM initialisation method using Hessian-weighted Mahalanobis distance. Across compression rates, search budgets, and three architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B), OA-EM consistently produces better solutions after PV-tuning and dominates the quality-compute frontier. The severity of the bottleneck scales with \r{ho}: moderate at 3 bpp but extreme at 2 bpp, where poor initialisation can degrade perplexity by orders of magnitude. More broadly, our results highlight the importance of optimisation geometry in compressed model spaces, where initialisation can dominate subsequent search and fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

quantization
codebook initialization
large language models
extreme compression
optimization bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

codebook initialisation
extreme quantization
output-aware EM
Hessian-weighted Mahalanobis distance
optimization geometry
๐Ÿ”Ž Similar Papers
No similar papers found.
I
Ian W. Kennedy
Department of Computer Science, University of Sheffield, Sheffield, UK
Nafise Sadat Moosavi
Nafise Sadat Moosavi
The University of Sheffield
Natural Language ProcessingMachine Learning