LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection

πŸ“… 2026-06-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

204K/year
πŸ€– AI Summary
This work addresses the limitations of conventional quantization methods, which are constrained by fixed integer bit-widths and thus struggle to achieve optimal deployment of large language models under strict memory budgets. The authors propose LiftQuant, a novel framework that leverages a β€œlift-and-project” mechanism: it maps high-dimensional 1-bit lattice points back into the original weight space via a linear transformation, thereby enabling quasi-continuous bit-width compression for the first time in large language models. By transcending the integer bit-width constraint, LiftQuant allows fine-grained control over model compression ratios. On a 24GB GPU, it compresses a 70B-parameter model to an effective 2.4 bits per weight while significantly outperforming state-of-the-art 2-bit quantization approaches, achieving a truly Pareto-optimal deployment.
πŸ“ Abstract
Existing quantization methods are fundamentally limited by rigid, integer-based bit-widths (e.g., 2, 3-bit), resulting in a ``deployment gap" where Large Language Models cannot be optimally fitted to specific memory budgets. To bridge this gap, we introduce LiftQuant, a novel framework that enables continuous bit-width control for true Pareto-optimal deployment. The core innovation is a ``lift-then-project" mechanism which approximates low-dimensional weight vectors by projecting a simple 1-bit lattice from a higher-dimensional ``lifted" space. Crucially, the effective bit-width is determined simply by the ratio of the lifted dimension to the original dimension, which allows the bit-width to be tuned quasi-continuous as the dimension is a flexible structural parameter. This projection generates a structured yet non-uniform codebook, capturing the expressive power of Vector Quantization (VQ). While beneficial over VQ, LiftQuant's decoding path relies solely on linear transformations and 1-bit uniform quantizers, retaining hardware-friendly nature. This flexibility is transformative: LiftQuant enables a 70B LLM to be compressed to 2.4 bits to precisely fit a 24GB GPU, where its performance significantly surpasses state-of-the-art 2-bit models fitted on the same device. Our code and ckpt is available at https://github.com/Heliulu/LiftQuant.
Problem

Research questions and friction points this paper is trying to address.

quantization
bit-width
Large Language Models
memory budget
deployment gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

continuous bit-width
dimensional lifting
vector quantization
hardware-friendly quantization
Pareto-optimal deployment
πŸ”Ž Similar Papers