π€ AI Summary
On resource-constrained edge devices, autoregressive LLM inference suffers from memory overflow and high latency due to unbounded growth of the KV cache. This paper proposes a resource-aware dynamic Transformer sharding framework: it is the first to enable decoder sharding at the attention-head granularity with coordinated KV cache placement, and introduces a lightweight online scheduler supporting runtime cross-device migration to adapt to resource fluctuations. By jointly optimizing cache locality and computational parallelism, the approach avoids the rigid bottlenecks inherent in conventional layer-wise partitioning. Experiments demonstrate that, in small-scale deployments (3β5 devices), the method achieves over 90% of the optimal solutionβs efficiency; in large-scale settings, it reduces inference latency by 15β20% and significantly lowers peak memory consumption compared to state-of-the-art layer-splitting methods.
π Abstract
Large Language Models (LLMs) based on autoregressive, decoder-only Transformers generate text one token at a time, where a token represents a discrete unit of text. As each newly produced token is appended to the partial output sequence, the length grows and so does the memory and compute load, due to the expanding key-value caches, which store intermediate representations of all previously generated tokens in the multi-head attention (MHA) layer. As this iterative process steadily increases memory and compute demands, layer-based partitioning in resource-constrained edge environments often results in memory overload or high inference latency. To address this and reduce inference latency, we propose a resource-aware Transformer architecture partitioning algorithm, where the partitioning decision is updated at regular intervals during token generation. The approach is myopic in that it is based on instantaneous information about device resource availability and network link bandwidths. When first executed, the algorithm places blocks on devices, and in later executions, it migrates these blocks among devices so that the sum of migration delay and inference delay remains low. Our approach partitions the decoder at the attention head level, co-locating each attention head with its key-value cache and allowing dynamic migrations whenever resources become tight. By allocating different attention heads to different devices, we exploit parallel execution of attention heads and thus achieve substantial reductions in inference delays. Our experiments show that in small-scale settings (3-5 devices), the proposed method achieves within 15 to 20 percent of an exact optimal solver's latency, while in larger-scale tests it achieves notable improvements in inference speed and memory usage compared to state-of-the-art layer-based partitioning approaches.