A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length

📅 2024-07-07

🏛️ International Symposium on Modeling and Optimization in Mobile, Ad-Hoc and Wireless Networks

📈 Citations: 12

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Variable-length outputs in LLM interactive serving induce significant inference queuing latency due to output-token–dependent service times. Method: We propose a unified theoretical framework integrating M/G/1 and batch-service queueing models, the first to treat output token count as a stochastic service time. We jointly optimize the max-token limit and batch scheduling policies—fixed, dynamic, and elastic—to characterize their distinct latency behaviors under output-length uncertainty. Contribution/Results: Our analysis reveals the dominant impact of long-tail requests on mean queuing delay. Event-driven simulations validate model accuracy (<5% error): setting max-token = 256 reduces mean queuing delay by 38%; under load fluctuations, elastic batching cuts delay by 22% versus fixed batching. The core contribution is establishing a quantitative relationship between output-length variability and system latency, enabling principled co-optimization of inference parameters.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) propel the prosperity of interactive AI applications showcased by ChatGPT that demand timely response of inference services. However, LLM inference is computation intensive and memory intensive, and improper parameter configuration at LLM platforms may exacerbate the inference time. In this paper, we analyze the impact of LLM output token distribution on the inference queueing delay, where the max-token clipping and the batched inference are considered. By formulating an M/G/1 model, we observe that enforcing a maximum output token limit on a very small fraction of inference requests can significantly reduce the queueing delay, and our model facilitates the selection of the optimal limit. For the batch inference, we model the service process as a bulk queue in which the batch processing time is affected by the batch size and the maximum token size inside this batch jointly. The queueing delays of the batching of all buffered requests (dynamic batching), the batching of constant number of requests (fixed batching), and the batching without intra-batch waiting (elastic batching) are derived. Experimental results show that our mathematical models coincide with the event-driven simulations well.

Problem

Research questions and friction points this paper is trying to address.

Analyzes LLM inference queueing delay from token distribution impact

Models optimal max-token clipping to significantly reduce queueing delays

Derives queueing delays for dynamic, fixed, and elastic batching strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Queueing theory models LLM inference with variable token lengths

Maximum token clipping reduces queueing delay for few requests

Bulk queue modeling optimizes batch size and token limits

🔎 Similar Papers

No similar papers found.