SLO-Aware Scheduling for Large Language Model Inferences

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing LLM inference services overlook task-level SLO heterogeneity in multi-task coexistence scenarios, resulting in suboptimal hardware utilization and uncontrollable service quality. This paper proposes the first dynamic scheduler tailored for heterogeneous LLM inference tasks with multiple, distinct SLOs. It innovatively jointly models SLO constraints with request input/output lengths to enable fine-grained priority assignment; and introduces a simulated annealing–based online scheduling framework for real-time, SLO-aware sequence optimization. Evaluated on the Python-Code-23k-ShareGPT and ShareGPT_Vicuna_unfiltered datasets, our approach achieves up to a 5× improvement in SLO compliance rate and reduces average latency by 31.6%, significantly outperforming state-of-the-art systems vLLM and LMDeploy.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have revolutionized applications such as code completion, chatbots, and online classification. To elevate user experiences, service level objectives (SLOs) serve as crucial benchmarks for assessing inference services capabilities. In practice, an inference service processes multiple types of tasks, each with its own distinct SLO. To ensure satisfactory user experiences, each request's distinct SLOs should be considered in scheduling. However, existing designs lack this consideration, leading to insufficient hardware utility and suboptimal performance. This paper analyzes scenarios to process tasks with varying SLOs, and introduces a simulated annealing-based scheduler to decide request priority sequence based on a request's SLO, input lengths, and possible output lengths. As the first specialized scheduler for multi-SLO scenarios, this work improves SLO attainment by up to 5x and reduces average latency by 31.6% on Python-Code-23k-ShareGPT and ShareGPT_Vicuna_unfiltered datasets, compared to current state-of-the-art framework vLLM and a new framework LMDeploy.

Problem

Research questions and friction points this paper is trying to address.

SLO-aware scheduling for diverse LLM inference tasks

Optimizing hardware utility and performance in multi-SLO scenarios

Reducing latency and improving SLO attainment in LLM services

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulated annealing-based scheduler for LLM

Considers SLO, input, output lengths

Improves SLO attainment and latency

🔎 Similar Papers

No similar papers found.

Authors to Follow