DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

To address the challenge of efficiently executing large language models (LLMs) on-device under dynamic operational constraints—such as latency and accuracy—this paper proposes a runtime layer-wise dynamic precision allocation mechanism. The core insight is the first empirical observation that layer-wise sensitivity in LLMs varies significantly across decoding steps. Leveraging this, we design a lightweight online error estimator and a differentiable threshold learning method, integrated with multi-scale quantization, to adaptively determine the bit-width for each linear module per layer. Unlike conventional static or mixed-precision approaches, our method requires no retraining and enables zero-shot deployment. Evaluated on multiple open-source LLMs (e.g., Phi-3, Qwen2) and benchmarks (MT-Bench, AlpacaEval), it achieves, on average, +1.8 points in accuracy at fixed latency—or 37% lower on-device latency at equivalent accuracy—significantly improving the accuracy–latency trade-off over state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

How can we effectively handle queries for on-device large language models (LLMs) with varying runtime constraints, such as latency and accuracy? Multi-scale quantization addresses this challenge by enabling memory-efficient runtime model adaptation of LLMs through the overlaying of multiple model variants quantized to different bitwidths. Meanwhile, an important question still remains open-ended: how can models be properly configured to match a target precision or latency? While mixed-precision offers a promising solution, we take this further by leveraging the key observation that the sensitivity of each layer dynamically changes across decoding iterations. Building on this insight, we introduce DP-LLM, a novel mechanism that dynamically assigns precision to each layer based on input values. DP-LLM augments each linear layer in an LLM with a precision selector that determines the bitwidth at runtime using a lightweight error estimator and threshold values learned through fine-tuning. Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches.

Problem

Research questions and friction points this paper is trying to address.

Dynamic precision assignment for on-device LLM queries

Optimizing latency-accuracy tradeoff via layer-wise bitwidth adaptation

Lightweight runtime precision selection using error estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic layer-wise precision assignment for LLMs

Lightweight error estimator for runtime bitwidth selection

Fine-tuned threshold values for adaptive performance-latency trade-off

🔎 Similar Papers

ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank Adaptation