IMPQ: Interaction-Aware Layerwise Mixed Precision Quantization for LLMs

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the significant performance degradation of large language models (LLMs) under ultra-low-bit (<4-bit) mixed-precision quantization for resource-constrained edge deployment, this paper proposes the first mixed-precision quantization method explicitly modeling inter-layer interactions. Unlike conventional approaches relying on isolated layer-wise metrics, we introduce a cooperative game-theoretic framework and design Shapley-value-based Progressive Quantization Sensitivity Estimation (SPQE) to accurately characterize both per-layer sensitivity and cross-layer dependencies. We then formulate an optimal 2/4-bit precision allocation problem as a binary quadratic program. Our method is compatible with mainstream post-training quantization (PTQ) backends—including Quanto, HQQ, and GPTQ—and is validated on Llama-3, Gemma-2, and Qwen-3. When the average bit-width is reduced to 2 bits, perplexity improves by 20%–80% over state-of-the-art baselines, with gains increasing monotonically with compression ratio.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) promise impressive capabilities, yet their multi-billion-parameter scale makes on-device or low-resource deployment prohibitive. Mixed-precision quantization offers a compelling solution, but existing methods struggle when the average precision drops below four bits, as they rely on isolated, layer-specific metrics that overlook critical inter-layer interactions affecting overall performance. In this paper, we propose two innovations to address these limitations. First, we frame the mixed-precision quantization problem as a cooperative game among layers and introduce Shapley-based Progressive Quantization Estimation (SPQE) to efficiently obtain accurate Shapley estimates of layer sensitivities and inter-layer interactions. Second, building upon SPQE, we propose Interaction-aware Mixed-Precision Quantization (IMPQ) which translates these Shapley estimates into a binary quadratic optimization formulation, assigning either 2 or 4-bit precision to layers under strict memory constraints. Comprehensive experiments conducted on Llama-3, Gemma-2, and Qwen-3 models across three independent PTQ backends (Quanto, HQQ, GPTQ) demonstrate IMPQ's scalability and consistently superior performance compared to methods relying solely on isolated metrics. Across average precisions spanning 4 bit down to 2 bit, IMPQ cuts Perplexity by 20 to 80 percent relative to the best baseline, with the margin growing as the bit-width tightens.
Problem

Research questions and friction points this paper is trying to address.

Addressing performance degradation in low-bit LLM quantization
Capturing critical inter-layer interactions for mixed-precision assignment
Optimizing 2/4-bit layer allocation under strict memory constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shapley-based Progressive Quantization Estimation for sensitivity
Binary quadratic optimization for precision assignment
Interaction-aware mixed-precision quantization under constraints
🔎 Similar Papers
No similar papers found.
J
Junchen Zhao
University of California, Irvine
A
Ali Derakhshan
University of California, Irvine
D
Dushyant Bharadwaj
University of California, Irvine
J
Jayden Kana Hyman
University of California, Irvine
J
Junhao Dong
University of California, Irvine
Sangeetha Abdu Jyothi
Sangeetha Abdu Jyothi
University of California, Irvine
NetworkingComputer SystemsMachine Learning
I
Ian Harris
University of California, Irvine