🤖 AI Summary
To address high end-to-end latency and low resource utilization in LLM inference across heterogeneous edge-fog-cloud infrastructures, this paper proposes a two-stage hierarchical optimization framework. In the offline stage, a lightweight cross-layer model partitioning scheme is achieved via Binary Search combined with Dynamic Programming (BSDP). In the online stage, an Adaptive Request Task Scheduler (ARTS) dynamically allocates inference requests based on real-time queue states and effective computational capacity estimation. Crucially, the framework decouples slow-timescale model partitioning from fast-timescale task scheduling—enabling global optimization while preserving runtime adaptability—without model retraining or non-negligible overhead. Experiments on Phi-3-medium demonstrate up to 52.1% reduction in end-to-end latency versus GPipe and HEFT; even for long-sequence generation, latency remains 44.5% lower. GPU utilization is significantly improved, confirming both efficiency and scalability.
📝 Abstract
Large Language Models (LLMs) are increasingly executed across edge, fog, and cloud tiers where limited GPU memory, heterogeneous compute, and variable inter-tier bandwidth jointly constrain deployment and motivate model partitioning and request scheduling. In this setting, achieving low end-to-end latency is governed not only by where a model is deployed (inter-tier model partitioning) but also by how incoming requests are scheduled (intra-tier task scheduling) across heterogeneous nodes. These two problems are tightly coupled, as a suboptimal scheduler can negate the benefits of a good partition, and vice versa. In this paper, we propose Hyperion, a hierarchical two-stage framework that jointly optimizes partitioning and scheduling to minimize end-to-end latency for pipelined LLM inference in multi-tier networks, balancing compute and memory across tiers while introducing negligible runtime overhead and requiring no model retraining. Motivated by the observation that partition choices evolve on slower timescales than request arrivals, Stage 1 performs offline, inter-tier partitioning via a Binary Search with Dynamic Programming (BSDP) procedure to produce balanced stage times under tier capacity and memory constraints; to adapt to time-varying load, Stage 2 performs online, intra-tier scheduling with a lightweight Adaptive Real-time Task Scheduling (ARTS) algorithm that maps each request to the best available node using real-time estimates of queue length and effective capacity. Experimental results on multi-tier inference tasks demonstrate that Hyperion significantly reduces end-to-end latency by up to 52.1% and 31.2%, with the Phi-3-medium model, compared to the GPipe and HEFT baselines, respectively. Furthermore, Hyperion shows superior scalability in long-sequence generation, maintaining a 44.5% lower latency than GPipe and achieving higher GPU utilization.