Bullet: Boosting GPU Utilization for LLM Serving via Dynamic Spatial-Temporal Orchestration

📅 2025-04-28

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

The inherent mismatch between the compute-intensive prefill and memory-bound decode phases in LLM serving leads to suboptimal GPU utilization, while existing hybrid batching approaches achieve inefficient latency-throughput trade-offs. This paper proposes a dynamic spatiotemporal co-scheduling framework. It introduces the first spatial-temporal joint orchestration mechanism to enable concurrent execution of prefill and decode; designs an SLO-aware dynamic resource provisioning model that decouples latency constraints from throughput optimization; and integrates real-time performance modeling, attention bottleneck mitigation, wave quantization optimization, and adaptive GPU compute/memory allocation. Evaluated under realistic workloads, the framework achieves a 1.26× average throughput improvement (up to 1.55× peak) while strictly satisfying end-to-end latency SLOs.

Technology Category

Application Category

📝 Abstract

Modern LLM serving systems confront inefficient GPU utilization due to the fundamental mismatch between compute-intensive prefill and memory-bound decode phases. While current practices attempt to address this by organizing these phases into hybrid batches, such solutions create an inefficient tradeoff that sacrifices either throughput or latency, leaving substantial GPU resources underutilized. We identify two key root causes: 1) the prefill phase suffers from suboptimal compute utilization due to wave quantization and attention bottlenecks. 2) hybrid batches disproportionately prioritize latency over throughput, resulting in wasted compute and memory bandwidth. To mitigate the issues, we present Bullet, a novel spatial-temporal orchestration system that eliminates these inefficiencies through precise phase coordination. Bullet enables concurrent execution of prefill and decode phases, while dynamically provisioning GPU resources using real-time performance modeling. By integrating SLO-aware scheduling and adaptive resource allocation, Bullet maximizes utilization without compromising latency targets. Experimental evaluations on real-world workloads demonstrate that Bullet delivers 1.26x average throughput gains (up to 1.55x) over state-of-the-arts, while consistently meeting latency constraints.

Problem

Research questions and friction points this paper is trying to address.

Inefficient GPU utilization in LLM serving systems

Mismatch between compute and memory phases

Tradeoff between throughput and latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic spatial-temporal orchestration for GPU utilization

Concurrent prefill and decode phase execution

SLO-aware scheduling with adaptive resource allocation

🔎 Similar Papers

No similar papers found.