🤖 AI Summary
Deploying large language models on heterogeneous hardware requires flexible trade-offs between performance and efficiency without retraining. This work proposes Drop-by-Drop, a novel framework that, for the first time, integrates Matryoshka-style supervision with additive codebooks to enable runtime-adjustable precision across multiple bit widths within a single model. Grounded in information theory and successive refinement principles, the method employs additive codebooks, a weighted mean squared error distortion metric, and post-training quantization, making it compatible with mainstream architectures such as Qwen and LLaMA. Extensive experiments demonstrate that Drop-by-Drop significantly reduces storage and memory overhead while maintaining competitive perplexity and accuracy across diverse models.
📝 Abstract
As large language models (LLMs) are increasingly deployed across heterogeneous hardware with varying resource constraints, the ability to adaptively manage the trade-off between performance and efficiency without retraining is critical. We propose Drop-by-Drop, a novel multi-bitwidth post-training quantization framework that enables inference-time precision control over LLM weights from a single trained model. Our method is theoretically grounded in information theory and successive refinement. We establish that LLM weights, which commonly follow a Gaussian distribution, can be optimally reconstructed with increasing fidelity as additional bits are incorporated, under a weighted mean squared error distortion motivated by LLM loss functions. To realize this in practice, Drop-by-Drop incorporates Matryoshka-style supervision into the loss function, exploiting the structure of additive codebooks. Drop-by-Drop produces a single model where ordered subsets of codebooks yield accurate partial reconstructions at each precision level. This approach significantly reduces storage and memory overhead by allowing a single checkpoint to serve multiple bitwidths, while maintaining competitive perplexity and accuracy across major architectures, such as Qwen, LLaMA, Gemma, and Mistral.