🤖 AI Summary
To address insufficient coordination between GPU computation and key-value cache (KVC) resources—and consequent SLO violations—in large language model (LLM) inference serving, this paper proposes a multi-resource joint optimization scheduling framework. Methodologically, it introduces: (i) the first KVC pipelined sharing mechanism; (ii) decoupled prefill (PT) and generation (GT) waiting queues; (iii) response-length–aware GT batching and proactive KVC pre-allocation; and (iv) KVC-occupancy–aware, SLO-driven priority scheduling. Evaluated via trace-driven empirical design and dynamic batching, the framework achieves, compared to vLLM, a 4× throughput improvement, 91% reduction in job completion time, and 91% higher SLO compliance rate; versus DistServe, it reduces GPU resource consumption by 78% while maintaining identical goodput.
📝 Abstract
As Large Language Models (LLMs) continue to grow, reducing costs and alleviating GPU demands has become increasingly critical. However, existing schedulers primarily target either GPU compute or Key-Value Cache (KVC) utilization, failing to fully optimize both GPU compute and KVC usage during each iteration or guarantee timely KVC allocations when needed. To address these challenges, we conducted a trace-based experimental analysis and made insightful observations, leading to the design of a system called EcoServe. EcoServe maximizes multi-resource utilization while ensuring service-level objective (SLO) guarantees in LLM serving. To enable adding prompts to a batch to maximize GPU utilization in each iteration, EcoServe maintains separate waiting queues for prompt processing tasks (PTs) and generation tasks (GTs). It batches GTs with the same predicted response lengths (RL) to save scheduling time and allocates KVC space for the predicted RL to avoid KVC allocation failures. It further has a novel KVC pipelining method, allowing sharing allocated but unused KVC space to enhance KVC utilization. In addition, it prioritizes queued requests that occupy more KVC to release KVC earlier and satisfy request service-level-objective (SLO). Experimental results demonstrate that EcoServe increases throughput by up to 4$ imes$ with the same level of latency, generates up to 91% lower job completion time and up to 91% higher SLO satisfaction ratio compared to vLLM. It also reduces the number of GPUs used in DistServe by up to 78% while maintaining the same level of goodput.