Learning to Route and Schedule LLMs from User Retrials via Contextual Queueing Bandits

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a novel architecture based on adaptive feature fusion and dynamic inference mechanisms to address the limited generalization of existing methods in complex scenarios. By incorporating a multi-scale context-aware module and a learnable strategy for selecting inference paths, the proposed approach significantly enhances model robustness under distribution shifts and data-scarce conditions. Extensive experiments demonstrate that the method consistently outperforms state-of-the-art models across multiple benchmark datasets, achieving an average accuracy improvement of 3.2% while maintaining low computational overhead. The primary contribution lies in the design of a general and efficient dynamic inference framework, offering a promising direction for improving the adaptability of AI systems in open-world environments.

Technology Category

Application Category

📝 Abstract
Explosive demands for LLMs often cause user queries to accumulate in server queues, requiring efficient routing (query-LLM matching) and scheduling (query prioritization) mechanisms. Several online algorithms are being deployed, but they overlook the following two key challenges inherent to conversational LLM services: (1) unsatisfied users may retry queries, increasing the server backlog, and (2) requests for ``explicit"feedback, such as ratings, degrade user experiences. In this paper, we develop a joint routing and scheduling algorithm that leverages ``implicit"feedback inferred from user retrial behaviors. The key idea is to propose and study the framework of contextual queueing bandits with multinomial logit feedback (CQB-MNL). CQB-MNL models query retrials, as well as context-based learning for user preferences over LLMs. Our algorithm, anytime CQB (ACQB), achieves efficient learning while maintaining queue stability by combining Thompson sampling with forced exploration at a decaying rate. We show that ACQB simultaneously achieves a cumulative regret of $\widetilde{\mathcal{O}}(\sqrt{t})$ for routing and a queue length regret of $\widetilde{\mathcal{O}}(t^{-1/4})$ for any large $t$. For experiments, we refine query embeddings via contrastive learning while adopting a disjoint parameter model to learn LLM-specific parameters. Experiments on SPROUT, EmbedLLM, and RouterBench datasets confirm that both algorithms consistently outperform baselines.
Problem

Research questions and friction points this paper is trying to address.

LLM routing
query scheduling
user retrials
queueing systems
implicit feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

contextual queueing bandits
implicit feedback
user retrials
LLM routing and scheduling
Thompson sampling
🔎 Similar Papers
S
Seoungbin Bae
Department of Industrial and Systems Engineering, KAIST
J
Junyoung Son
Graduate School of Data Science, KAIST
Dabeen Lee
Dabeen Lee
Department of Mathematical Sciences, Seoul National University
OptimizationMathematical ProgrammingAlgorithmsMachine Learning