🤖 AI Summary
This work addresses the limited scalability of tensor parallelism in large model online inference, where non-scalable overheads hinder near-linear cluster performance scaling. The authors propose Albireo, a system that eliminates such bottlenecks without modifying model architecture by overlapping scheduling with computation, employing sequence-parallel sampling, and optimizing KV cache management. Albireo further introduces the concept of “empirically optimal tensor parallelism degree” to guide parallelism strategy selection. Experimental results demonstrate that, compared to vLLM, Albireo achieves up to 1.9× higher throughput, 48% lower latency, 28% improved GPU utilization, and 54% reduced energy consumption, with a twofold throughput gain observed in production environments.
📝 Abstract
Deployers of online LLM services usually seek to maximize cluster-wide performance given a fixed number of GPUs. Tensor parallelism (TP) is necessary to fit modern models but scales sub-linearly as the TP degree t grows, due to cross-GPU communication and non-scalable runtime work, as predicted by Amdahl's Law. Conversely, increasing t improves memory efficiency and alleviates KV-cache contention and swapping. We identify and validate an empirical optimal TP degree t_e that balances these effects. We present Albireo, a parallel inference system that raises the attainable t_e by shrinking the non-scalable portion via overlap of scheduling and I/O with compute and sequence-parallel sampling, without changing model architectures. Across models and benchmarks, Albireo achieves up to 1.9x higher throughput, 48% lower latency, 28% higher GPU utilization, and 54% lower energy than vLLM; in production it yields up to 2x higher throughput.