DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of efficiently executing large language models (LLMs) on-device under dynamic operational constraints—such as latency and accuracy—this paper proposes a runtime layer-wise dynamic precision allocation mechanism. The core insight is the first empirical observation that layer-wise sensitivity in LLMs varies significantly across decoding steps. Leveraging this, we design a lightweight online error estimator and a differentiable threshold learning method, integrated with multi-scale quantization, to adaptively determine the bit-width for each linear module per layer. Unlike conventional static or mixed-precision approaches, our method requires no retraining and enables zero-shot deployment. Evaluated on multiple open-source LLMs (e.g., Phi-3, Qwen2) and benchmarks (MT-Bench, AlpacaEval), it achieves, on average, +1.8 points in accuracy at fixed latency—or 37% lower on-device latency at equivalent accuracy—significantly improving the accuracy–latency trade-off over state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
How can we effectively handle queries for on-device large language models (LLMs) with varying runtime constraints, such as latency and accuracy? Multi-scale quantization addresses this challenge by enabling memory-efficient runtime model adaptation of LLMs through the overlaying of multiple model variants quantized to different bitwidths. Meanwhile, an important question still remains open-ended: how can models be properly configured to match a target precision or latency? While mixed-precision offers a promising solution, we take this further by leveraging the key observation that the sensitivity of each layer dynamically changes across decoding iterations. Building on this insight, we introduce DP-LLM, a novel mechanism that dynamically assigns precision to each layer based on input values. DP-LLM augments each linear layer in an LLM with a precision selector that determines the bitwidth at runtime using a lightweight error estimator and threshold values learned through fine-tuning. Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches.
Problem

Research questions and friction points this paper is trying to address.

Dynamic precision assignment for on-device LLM queries
Optimizing latency-accuracy tradeoff via layer-wise bitwidth adaptation
Lightweight runtime precision selection using error estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic layer-wise precision assignment for LLMs
Lightweight error estimator for runtime bitwidth selection
Fine-tuned threshold values for adaptive performance-latency trade-off
🔎 Similar Papers
No similar papers found.