gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling

πŸ“… 2025-04-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address severe pipeline bubbles in distributed LLM inference caused by computational imbalance between the prefill and decode stages under pipeline parallelism, this paper proposes gLLM, a globally load-balanced inference system. Methodologically, it introduces (1) token throttlingβ€”a fine-grained, cross-batch scheduling mechanism that dynamically coordinates prefill and decode token counts; (2) a dual-dimension adaptive prefill batch sizing strategy leveraging global KV cache utilization and pending token count; (3) the first near-constant per-batch token count control during decode; and (4) an asynchronous, message-driven lightweight pipeline runtime. Evaluated on mainstream LLMs, gLLM achieves 11%–398% higher throughput and significantly reduced end-to-end latency compared to state-of-the-art systems such as Sarathi-Serve.

Technology Category

Application Category

πŸ“ Abstract
Pipeline parallelism has emerged as a predominant approach for deploying large language models (LLMs) across distributed nodes, owing to its lower communication overhead compared to tensor parallelism. While demonstrating high throughput in request serving, pipeline parallelism often suffers from performance limitations caused by pipeline bubbles, which are primarily resulted from imbalanced computation delays across batches. Existing methods like Sarathi-Serve attempt to address this through hybrid scheduling of chunked prefill and decode tokens using a fixed token budget. However, such methods may experience significant fluctuations due to either insufficient prefill tokens or uneven distribution of decode tokens, ultimately leading to computational imbalance. To overcome these inefficiencies, we present gLLM, a globally balanced pipeline parallelism system incorporating Token Throttling to effectively mitigate the pipeline bubbles. Our Token Throttling mechanism is a fine-grained scheduling policy that independently regulates the quantities of prefill and decode tokens, thus enabling balanced computation by leveraging global information from the inference system. Specifically, for decode tokens, gLLM maintains near-consistent token count across processing batches. For prefill tokens, it dynamically adjusts batch sizes based on both total pending tokens and the memory utilization rates of key-value cache (KV cache). Furthermore, gLLM runtime adopts an asynchronous execution and message passing architecture specifically optimized for pipeline parallelism characteristics. Experimental evaluations with representative LLMs show that gLLM achieves significant performance improvements, delivering 11% to 398% higher maximum throughput compared to state-of-the-art pipeline or tensor parallelism systems, while simultaneously maintaining lower latency.
Problem

Research questions and friction points this paper is trying to address.

Mitigate pipeline bubbles in distributed LLM serving
Balance prefill and decode tokens globally
Optimize throughput and latency in pipeline parallelism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token Throttling balances prefill and decode tokens
Dynamic batch sizing based on KV cache usage
Asynchronous execution optimized for pipeline parallelism
T
Tianyu Guo
Sun Yat-sen University, Guangzhou, China
Xianwei Zhang
Xianwei Zhang
Sun Yat-sen U.; AMD Research/RTG
Architecture/SystemCompilationGPU/MemoryHPCSimulation/Modeling
J
Jiangsu Du
Sun Yat-sen University, Guangzhou, China
Z
Zhiguang Chen
Sun Yat-sen University, Guangzhou, China
N
Nong Xiao
Sun Yat-sen University, Guangzhou, China
Y
Yutong Lu
Sun Yat-sen University, Guangzhou, China