Lodestar: An Online-Learning LLM Inference Router

📅 2026-05-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
This work addresses the challenges of serving large language model (LLM) inference requests in distributed GPU clusters, where strong input dependencies, high request coupling, and nonlinear latency dynamics render traditional load-balancing approaches ineffective. To overcome these limitations, the study introduces an online learning mechanism into LLM inference routing, proposing a dynamic request allocation system based on online reinforcement learning. By continuously collecting real-time cluster states, request characteristics, and performance feedback, the system iteratively refines its reward predictor to optimize routing decisions. Seamlessly integrated into mainstream inference frameworks such as vLLM, the proposed approach significantly outperforms existing heuristic strategies in both public cloud and heterogeneous GPU environments—reducing average time-to-first-token (TTFT) by up to 1.41× (1.47× at P99), with gains reaching 4.38× and 4.42× respectively under heterogeneity—and achieves policy convergence within five minutes.
📝 Abstract
Efficiently serving large language model (LLM) inference tasks is crucial both for user-perceived latency such as time-to-first-token (TTFT) and for GPU utilization. However, LLM request routing, that is, assigning each inference request to a GPU instance, is particularly challenging: execution is highly input-dependent; batching and KV-cache reuse create strong cross-request coupling; and latency responds nonlinearly to context length, model/engine settings, and heterogeneous accelerators. As a result, simple traditional load balancing algorithms, and even heuristics tailored for LLM inference, fail to achieve good performance. We present Lodestar, a novel learning-based request routing system for distributed GPU clusters. Lodestar continuously collects a snapshot of the cluster at per-request level, including real-time instance state, request characteristics, and observed performance, and trains an online reward predictor that it uses to route inference requests to the instance that will maximize given reward (e.g., minimizing TTFT). Lodestar is cloud-native and works seamlessly with existing serving stacks (vLLM). With continuous online adaptation to changing workloads and infrastructure conditions, Lodestar achieves 1.41x lower average TTFT and 1.47x lower P99 TTFT on average (up to 2.15x/1.86x on homogeneous and 4.38x/4.42x on heterogeneous clusters) compared to a state-of-the-art prefix cache and load-aware heuristic, and learns these efficient routing strategies within about 5 minutes, based on experiments in a public cloud GPU cluster.
Problem

Research questions and friction points this paper is trying to address.

LLM inference
request routing
load balancing
heterogeneous accelerators
latency optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

online learning
LLM inference routing
reward prediction
heterogeneous GPU clusters
TTFT optimization
🔎 Similar Papers
No similar papers found.