E2LLM: Towards Efficient LLM Serving in Heterogeneous Edge/Fog Environments

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the challenges of deploying large language models (LLMs) in resource-constrained edge and fog computing environments, where individual devices cannot host an entire model, leading to high latency and inefficient resource utilization. To overcome these limitations, the authors propose a role-based replication mechanism that clusters devices into multiple model replicas, with each replica internally partitioned into PREFILL and DECODER roles based on the distinct computational characteristics of inference stages—departing from conventional uniform model-splitting paradigms. By integrating genetic algorithms for device clustering and dynamic programming for model partitioning, the approach significantly enhances load balancing and adaptability to heterogeneous hardware resources. Experimental results demonstrate that under high workload conditions, the method reduces average request waiting time by over 50% compared to the Splitwise baseline and effectively handles diverse workloads with highly variable input and output lengths.

📝 Abstract

Large Language Models (LLMs) have become integral to modern applications, yet their deployment remains challenging. Beyond executing the models themselves, practical deployment must address cost efficiency, low latency, and optimal resource utilization. Conventional approaches typically assume that an entire model can be hosted on a single device, which does not hold in many real-world scenarios, particularly in Edge and Fog environments where device resources are constrained. In this paper, we introduce E2LLM, a framework designed to enable efficient LLM deployment in such resource limited settings. Rather than simply partitioning a single model across all available devices, E2LLM replicates the full model across multiple groups of devices (replicas) and applies model parallelism within each replica. Each replica is assigned a specialized role PREFILL or DECODER based on its efficiency in handling input and output tokens. This separation leverages the inherent differences between these two phases of LLM inference. To effectively organize devices, we utilize a Genetic Algorithm to form clusters that maximize system performance. Within each cluster, we apply Dynamic Programming to determine an optimal partitioning strategy that minimizes bottlenecks in model-parallel execution. Experimental results demonstrate that our approach adapts robustly to varying workloads, including scenarios with significant variation in input and output token lengths. Compared to the Splitwise baseline, E2LLM reduces average waiting time by over 50% under high-demand conditions

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Edge Computing

Fog Computing

Model Deployment

Resource Constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

model parallelism

heterogeneous edge computing

LLM inference optimization