Accuracy-Delay Trade-Off in LLM Offloading via Token-Level Uncertainty

📅 2026-02-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the accuracy–latency trade-off in large language model (LLM) inference within mobile edge computing, where limited local resources and edge communication or queuing delays pose significant challenges. The authors propose a dynamic offloading framework based on token-level uncertainty, introducing a boundary-aware token uncertainty metric to guide decision-making. A greedy offloading algorithm (GOA) is designed to adaptively determine, for each generated token, whether to execute locally or offload to the edge server. This approach enables efficient and scalable joint optimization of accuracy and latency in multi-user scenarios. Experimental results demonstrate that GOA consistently achieves substantial latency reduction across varying user densities while maintaining high inference accuracy and incurring manageable computational overhead, outperforming existing baseline methods.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) offer significant potential for intelligent mobile services but are computationally intensive for resource-constrained devices. Mobile edge computing (MEC) allows such devices to offload inference tasks to edge servers (ESs), yet introduces latency due to communication and serverside queuing, especially in multi-user environments. In this work, we propose an uncertainty-aware offloading framework that dynamically decides whether to perform inference locally or offload it to the ES, based on token-level uncertainty and resource constraints. We define a margin-based token-level uncertainty metric and demonstrate its correlation with model accuracy. Leveraging this metric, we design a greedy offloading algorithm (GOA) that minimizes delay while maintaining accuracy by prioritizing offloading for highuncertainty queries. Our experiments show that GOA consistently achieves a favorable trade-off, outperforming baseline strategies in both accuracy and latency across varying user densities, and operates with practical computation time. These results establish GOA as a scalable and effective solution for LLM inference in MEC environments.
Problem

Research questions and friction points this paper is trying to address.

LLM offloading
accuracy-delay trade-off
mobile edge computing
token-level uncertainty
resource-constrained devices
Innovation

Methods, ideas, or system contributions that make the work stand out.

token-level uncertainty
LLM offloading
accuracy-delay trade-off
mobile edge computing
greedy offloading algorithm
🔎 Similar Papers
No similar papers found.