Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents

📅 2025-04-10

📈 Citations: 0

✨ Influential: 0

career value

251K/year

🤖 AI Summary

Existing LLM and AI agent inference systems lack a rigorous queuing-theoretic foundation for throughput optimization. Method: This work establishes the first queuing-theoretic framework for LLM inference services, grounded in mathematical modeling and stability theory. It introduces *work conservation* as the fundamental criterion for throughput optimality and formally proves that a broad class of work-conserving schedulers achieves throughput optimality under both LLM and AI agent workloads. Contribution/Results: The analysis reveals critical stability differences among mainstream systems: Orca and Sarathi-serve attain theoretical throughput optimality, whereas FastTransformer and vanilla vLLM exhibit inherent stability limitations. These theoretical findings are validated through rigorous proofs and empirical evaluation across four production-grade systems—Orca, Sarathi-serve, FastTransformer, and vLLM—providing a verifiable mathematical foundation and principled scheduling guidelines for LLM inference system design.

Technology Category

Application Category

📝 Abstract

As demand for Large Language Models (LLMs) and AI agents rapidly grows, optimizing systems for efficient LLM inference becomes critical. While significant efforts have targeted system-level engineering, little is explored through a mathematical modeling and queuing perspective. In this paper, we aim to develop the queuing fundamentals for LLM inference, bridging the gap between queuing and LLM system communities. In particular, we study the throughput aspect in LLM inference systems. We prove that a large class of 'work-conserving' scheduling algorithms can achieve maximum throughput for both individual requests and AI agent workloads, highlighting 'work-conserving' as a key design principle in practice. Evaluations of real-world systems show that Orca and Sarathi-serve are throughput-optimal, reassuring practitioners, while FastTransformer and vanilla vLLM are not maximally stable and should be used with caution. Our results highlight the substantial benefits queuing community can offer in improving LLM inference systems and call for more interdisciplinary developments.

Problem

Research questions and friction points this paper is trying to address.

Optimize throughput in LLM inference systems

Develop queuing fundamentals for LLM inference

Evaluate throughput-optimal scheduling algorithms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Develop queuing fundamentals for LLM inference

Prove work-conserving algorithms maximize throughput

Evaluate real-world systems for throughput-optimality

🔎 Similar Papers

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing