Multi-Bitwidth Quantization for LLMs Using Additive Codebooks

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Deploying large language models on heterogeneous hardware requires flexible trade-offs between performance and efficiency without retraining. This work proposes Drop-by-Drop, a novel framework that, for the first time, integrates Matryoshka-style supervision with additive codebooks to enable runtime-adjustable precision across multiple bit widths within a single model. Grounded in information theory and successive refinement principles, the method employs additive codebooks, a weighted mean squared error distortion metric, and post-training quantization, making it compatible with mainstream architectures such as Qwen and LLaMA. Extensive experiments demonstrate that Drop-by-Drop significantly reduces storage and memory overhead while maintaining competitive perplexity and accuracy across diverse models.

📝 Abstract

As large language models (LLMs) are increasingly deployed across heterogeneous hardware with varying resource constraints, the ability to adaptively manage the trade-off between performance and efficiency without retraining is critical. We propose Drop-by-Drop, a novel multi-bitwidth post-training quantization framework that enables inference-time precision control over LLM weights from a single trained model. Our method is theoretically grounded in information theory and successive refinement. We establish that LLM weights, which commonly follow a Gaussian distribution, can be optimally reconstructed with increasing fidelity as additional bits are incorporated, under a weighted mean squared error distortion motivated by LLM loss functions. To realize this in practice, Drop-by-Drop incorporates Matryoshka-style supervision into the loss function, exploiting the structure of additive codebooks. Drop-by-Drop produces a single model where ordered subsets of codebooks yield accurate partial reconstructions at each precision level. This approach significantly reduces storage and memory overhead by allowing a single checkpoint to serve multiple bitwidths, while maintaining competitive perplexity and accuracy across major architectures, such as Qwen, LLaMA, Gemma, and Mistral.

Problem

Research questions and friction points this paper is trying to address.

multi-bitwidth quantization

large language models

post-training quantization

inference-time precision control

heterogeneous hardware

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-bitwidth quantization

post-training quantization

additive codebooks