SLO-Aware Scheduling for Large Language Model Inferences

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM inference services overlook task-level SLO heterogeneity in multi-task coexistence scenarios, resulting in suboptimal hardware utilization and uncontrollable service quality. This paper proposes the first dynamic scheduler tailored for heterogeneous LLM inference tasks with multiple, distinct SLOs. It innovatively jointly models SLO constraints with request input/output lengths to enable fine-grained priority assignment; and introduces a simulated annealing–based online scheduling framework for real-time, SLO-aware sequence optimization. Evaluated on the Python-Code-23k-ShareGPT and ShareGPT_Vicuna_unfiltered datasets, our approach achieves up to a 5× improvement in SLO compliance rate and reduces average latency by 31.6%, significantly outperforming state-of-the-art systems vLLM and LMDeploy.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have revolutionized applications such as code completion, chatbots, and online classification. To elevate user experiences, service level objectives (SLOs) serve as crucial benchmarks for assessing inference services capabilities. In practice, an inference service processes multiple types of tasks, each with its own distinct SLO. To ensure satisfactory user experiences, each request's distinct SLOs should be considered in scheduling. However, existing designs lack this consideration, leading to insufficient hardware utility and suboptimal performance. This paper analyzes scenarios to process tasks with varying SLOs, and introduces a simulated annealing-based scheduler to decide request priority sequence based on a request's SLO, input lengths, and possible output lengths. As the first specialized scheduler for multi-SLO scenarios, this work improves SLO attainment by up to 5x and reduces average latency by 31.6% on Python-Code-23k-ShareGPT and ShareGPT_Vicuna_unfiltered datasets, compared to current state-of-the-art framework vLLM and a new framework LMDeploy.
Problem

Research questions and friction points this paper is trying to address.

SLO-aware scheduling for diverse LLM inference tasks
Optimizing hardware utility and performance in multi-SLO scenarios
Reducing latency and improving SLO attainment in LLM services
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulated annealing-based scheduler for LLM
Considers SLO, input, output lengths
Improves SLO attainment and latency
🔎 Similar Papers
No similar papers found.
J
Jinqi Huang
Huawei Technologies Co., Ltd
Y
Yi Xiong
Huawei Technologies Co., Ltd
X
Xuebing Yu
Huawei Technologies Co., Ltd
Wenjie Huang
Wenjie Huang
Shanghai Jiao Tong University
点云压缩视频压缩图像压缩
E
Entong Li
Huawei Technologies Co., Ltd
Li Zeng
Li Zeng
Peking University
LLM training and inferenceVector ComputingGraph Computing
X
Xin Chen
Huawei Technologies Co., Ltd