Large Language Model Partitioning for Low-Latency Inference at the Edge

πŸ“… 2025-05-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
On resource-constrained edge devices, autoregressive LLM inference suffers from memory overflow and high latency due to unbounded growth of the KV cache. This paper proposes a resource-aware dynamic Transformer sharding framework: it is the first to enable decoder sharding at the attention-head granularity with coordinated KV cache placement, and introduces a lightweight online scheduler supporting runtime cross-device migration to adapt to resource fluctuations. By jointly optimizing cache locality and computational parallelism, the approach avoids the rigid bottlenecks inherent in conventional layer-wise partitioning. Experiments demonstrate that, in small-scale deployments (3–5 devices), the method achieves over 90% of the optimal solution’s efficiency; in large-scale settings, it reduces inference latency by 15–20% and significantly lowers peak memory consumption compared to state-of-the-art layer-splitting methods.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) based on autoregressive, decoder-only Transformers generate text one token at a time, where a token represents a discrete unit of text. As each newly produced token is appended to the partial output sequence, the length grows and so does the memory and compute load, due to the expanding key-value caches, which store intermediate representations of all previously generated tokens in the multi-head attention (MHA) layer. As this iterative process steadily increases memory and compute demands, layer-based partitioning in resource-constrained edge environments often results in memory overload or high inference latency. To address this and reduce inference latency, we propose a resource-aware Transformer architecture partitioning algorithm, where the partitioning decision is updated at regular intervals during token generation. The approach is myopic in that it is based on instantaneous information about device resource availability and network link bandwidths. When first executed, the algorithm places blocks on devices, and in later executions, it migrates these blocks among devices so that the sum of migration delay and inference delay remains low. Our approach partitions the decoder at the attention head level, co-locating each attention head with its key-value cache and allowing dynamic migrations whenever resources become tight. By allocating different attention heads to different devices, we exploit parallel execution of attention heads and thus achieve substantial reductions in inference delays. Our experiments show that in small-scale settings (3-5 devices), the proposed method achieves within 15 to 20 percent of an exact optimal solver's latency, while in larger-scale tests it achieves notable improvements in inference speed and memory usage compared to state-of-the-art layer-based partitioning approaches.
Problem

Research questions and friction points this paper is trying to address.

Reducing LLM inference latency in edge environments
Optimizing dynamic partitioning for resource-constrained devices
Managing memory overload from expanding key-value caches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic attention head partitioning for edge LLMs
Resource-aware algorithm updates partitioning periodically
Co-locates attention heads with key-value caches
πŸ”Ž Similar Papers
No similar papers found.