Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding

📅 2025-11-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high inference latency and potential accuracy degradation of large language models (LLMs) under resource-constrained mobile edge computing (MEC) scenarios, this paper proposes a resource-aware parallel speculative decoding framework. The framework pioneers the integration of parallel speculative decoding—where a lightweight draft model collaborates with a target LLM to jointly generate tokens—into the MEC architecture. It employs multi-agent deep reinforcement learning to jointly optimize user association and heterogeneous resource allocation across edge servers and end devices, thereby mitigating both communication overhead and asynchronous execution delays. Evaluated in the Sionna simulator, the proposed method achieves up to 28.0% and an average of 23.7% reduction in end-to-end inference latency while preserving original LLM accuracy. This significantly enhances the scalability and real-time responsiveness of LLM inference services in MEC environments.

Technology Category

Application Category

📝 Abstract
The growing demand for on-device large language model (LLM) inference highlights the need for efficient mobile edge computing (MEC) solutions, especially in resource-constrained settings. Speculative decoding offers a promising solution by partitioning token generation between a lightweight draft model on mobile devices and a powerful target model on edge servers, but suffers from communication overhead and asynchronous delays. This paper is the first to propose a unified framework that jointly optimizes user association and resource allocation (UARA) to support efficient parallel speculative decoding. We solve the UARA problem using a multi-agent deep reinforcement learning algorithm. To evaluate our approach under realistic conditions, we conduct experiments using the Sionna simulator. Results show that our method achieves up to 28.0% and an average of 23.7% reduction in end-to-end latency without compromising inference accuracy, enabling scalable and low-latency LLM services in MEC systems.
Problem

Research questions and friction points this paper is trying to address.

Optimizing user association and resource allocation for parallel speculative decoding
Reducing communication overhead in mobile edge computing systems
Minimizing latency while maintaining large language model inference accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes user association and resource allocation jointly
Uses multi-agent deep reinforcement learning algorithm
Enables parallel speculative decoding for latency reduction
🔎 Similar Papers
No similar papers found.
J
Jungyeon Koh
Department of Electrical Engineering, POSTECH, Pohang, Republic of Korea
Hyun Jong Yang
Hyun Jong Yang
Dept. of Electrical & Computer Engineering, Seoul National University
CommunicationsSignal ProcessingMachine Learning