Queue Management for SLO-Oriented Large Language Model Serving

📅 2024-06-05

🏛️ ACM Symposium on Cloud Computing

📈 Citations: 5

✨ Influential: 0

career value

227K/year

🤖 AI Summary

To address SLO violations and low resource utilization arising from co-scheduling interactive (strict-SLO) and batch (relaxed-SLO) requests in LLM serving, this paper proposes a unified global scheduling framework for multi-model, multi-SLO workloads. Our key innovation is the first request waiting time (RWT) estimator, which dynamically drives model pulling, eviction, load balancing, and model swapping. We integrate LSO-aware scheduling primitives, heterogeneous GPU adaptation strategies, and real-time queue priority reordering. Evaluated on real-world LLM serving traces, our framework improves SLO compliance by 40–90%, boosts throughput by 20–400%, and maintains or increases GPU utilization. The system has been deployed in production environments of major cloud providers.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) serving is becoming an increasingly critical workload for cloud providers. Existing LLM serving systems focus on interactive requests, such as chatbots and coding assistants, with tight latency SLO requirements. However, when such systems execute batch requests that have relaxed SLOs along with interactive requests, it leads to poor multiplexing and inefficient resource utilization. To address these challenges, we propose QLM, a queue management system for LLM serving. QLM maintains batch and interactive requests across different models and SLOs in a request queue. Optimal ordering of the request queue is critical to maintain SLOs while ensuring high resource utilization. To generate this optimal ordering, QLM uses a Request Waiting Time (RWT) Estimator that estimates the waiting times for requests in the request queue. These estimates are used by a global scheduler to orchestrate LLM Serving Operations (LSOs) such as request pulling, request eviction, load balancing, and model swapping. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400% while maintaining or improving device utilization compared to other state-of-the-art LLM serving systems. QLM's evaluation is based on the production requirements of a cloud provider. QLM is publicly available at https://www.github.com/QLM-project/QLM.

Problem

Research questions and friction points this paper is trying to address.

Optimizes queue management for LLM serving

Enhances resource utilization with mixed SLOs

Improves SLO attainment and throughput significantly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Queue management system QLM

Request Waiting Time Estimator

Global scheduler orchestrates LSOs

🔎 Similar Papers

A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length