LIME:Accelerating Collaborative Lossless LLM Inference on Memory-Constrained Edge Devices

📅 2025-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of lossless collaborative inference of large language models (LLMs) on memory- and bandwidth-constrained edge devices, this paper proposes a lossless distributed inference framework tailored for heterogeneous edge platforms (e.g., NVIDIA Jetson). Our method integrates (1) a novel interleaved pipeline parallelism scheme with dynamic model offloading, and (2) a hybrid scheduling strategy combining fine-grained offline allocation with online memory-adaptive runtime management, jointly optimizing accuracy, latency, and resource efficiency. Evaluated on a cluster of four heterogeneous Jetson devices running LLaMA3.3-70B-Instruct, the framework achieves 3.7× and 1.7× throughput improvements under bursty and sparse request patterns, respectively—while preserving full numerical precision throughout inference. This represents a significant advance in enabling real-time, lossless LLM collaboration at the edge.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have emerged as a powerful foundation for intelligent reasoning and decision-making, demonstrating substantial impact across a wide range of domains and applications. However, their massive parameter scales and substantial resource demands pose critical challenges for efficient inference on edge devices. These devices are inherently constrained by limited computational power and memory capacity, while bandwidth bottlenecks at the network edge further restrict distributed deployment and real-time responsiveness. Although existing research has explored lightweight optimization techniques to mitigate memory limitations, such approaches often incur significant degradation in model accuracy and performance. To address these challenges, we propose LIME, a collaborative system that enables lossless inference for large models across multiple memory-constrained edge devices under limited network bandwidth. LIME employs an interleaved pipeline parallelism in conjunction with model offloading to dynamically balance computation and communication. Furthermore, a fine-grained offline allocation scheduler and online memory adaptation strategy are introduced to enhance the device's computing and storage resources while minimizing inference latency. Extensive experiments demonstrate that LIME, deployed on four heterogeneous Nvidia Jetson edge devices for LLaMA3.3-70B-Instruct model inference, achieves 1.7$ imes$ and 3.7$ imes$ speedups over state-of-the-art baselines under sporadic and bursty request patterns respectively, without compromising model accuracy.
Problem

Research questions and friction points this paper is trying to address.

Enables lossless LLM inference on memory-constrained edge devices
Addresses computational and memory limits in distributed edge deployments
Minimizes inference latency without sacrificing model accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Collaborative lossless inference across multiple edge devices
Interleaved pipeline parallelism with dynamic model offloading
Fine-grained offline scheduler and online memory adaptation
🔎 Similar Papers
No similar papers found.
Mingyu Sun
Mingyu Sun
National Grid Electricity System Operator (ESO)
Power system dynamicsPMU
X
Xiao Zhang
School of Computer Science and Technology, Shandong University, Qingdao 266237, China
S
Shen Qu
School of Computer Science and Technology, Shandong University, Qingdao 266237, China
Y
Yan Li
School of Computer Science and Technology, Shandong University, Qingdao 266237, China
Mengbai Xiao
Mengbai Xiao
Shandong University
Dongxiao Yu
Dongxiao Yu
Professor of Computer Science, Shandong University
Distributed ComputingWireless NetworkingGraph Algorithms
Y
Yuan Yuan
School of Artificial Intelligence, Shandong University, Jinan 250100, China