🤖 AI Summary
Existing LLM inference services overlook task-level SLO heterogeneity in multi-task coexistence scenarios, resulting in suboptimal hardware utilization and uncontrollable service quality. This paper proposes the first dynamic scheduler tailored for heterogeneous LLM inference tasks with multiple, distinct SLOs. It innovatively jointly models SLO constraints with request input/output lengths to enable fine-grained priority assignment; and introduces a simulated annealing–based online scheduling framework for real-time, SLO-aware sequence optimization. Evaluated on the Python-Code-23k-ShareGPT and ShareGPT_Vicuna_unfiltered datasets, our approach achieves up to a 5× improvement in SLO compliance rate and reduces average latency by 31.6%, significantly outperforming state-of-the-art systems vLLM and LMDeploy.
📝 Abstract
Large language models (LLMs) have revolutionized applications such as code completion, chatbots, and online classification. To elevate user experiences, service level objectives (SLOs) serve as crucial benchmarks for assessing inference services capabilities. In practice, an inference service processes multiple types of tasks, each with its own distinct SLO. To ensure satisfactory user experiences, each request's distinct SLOs should be considered in scheduling. However, existing designs lack this consideration, leading to insufficient hardware utility and suboptimal performance. This paper analyzes scenarios to process tasks with varying SLOs, and introduces a simulated annealing-based scheduler to decide request priority sequence based on a request's SLO, input lengths, and possible output lengths. As the first specialized scheduler for multi-SLO scenarios, this work improves SLO attainment by up to 5x and reduces average latency by 31.6% on Python-Code-23k-ShareGPT and ShareGPT_Vicuna_unfiltered datasets, compared to current state-of-the-art framework vLLM and a new framework LMDeploy.