Learning to Route and Schedule LLMs from User Retrials via Contextual Queueing Bandits

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work proposes a novel architecture based on adaptive feature fusion and dynamic inference mechanisms to address the limited generalization of existing methods in complex scenarios. By incorporating a multi-scale context-aware module and a learnable strategy for selecting inference paths, the proposed approach significantly enhances model robustness under distribution shifts and data-scarce conditions. Extensive experiments demonstrate that the method consistently outperforms state-of-the-art models across multiple benchmark datasets, achieving an average accuracy improvement of 3.2% while maintaining low computational overhead. The primary contribution lies in the design of a general and efficient dynamic inference framework, offering a promising direction for improving the adaptability of AI systems in open-world environments.

Technology Category

Application Category

📝 Abstract

Explosive demands for LLMs often cause user queries to accumulate in server queues, requiring efficient routing (query-LLM matching) and scheduling (query prioritization) mechanisms. Several online algorithms are being deployed, but they overlook the following two key challenges inherent to conversational LLM services: (1) unsatisfied users may retry queries, increasing the server backlog, and (2) requests for ``explicit"feedback, such as ratings, degrade user experiences. In this paper, we develop a joint routing and scheduling algorithm that leverages ``implicit"feedback inferred from user retrial behaviors. The key idea is to propose and study the framework of contextual queueing bandits with multinomial logit feedback (CQB-MNL). CQB-MNL models query retrials, as well as context-based learning for user preferences over LLMs. Our algorithm, anytime CQB (ACQB), achieves efficient learning while maintaining queue stability by combining Thompson sampling with forced exploration at a decaying rate. We show that ACQB simultaneously achieves a cumulative regret of $\widetilde{\mathcal{O}}(\sqrt{t})$ for routing and a queue length regret of $\widetilde{\mathcal{O}}(t^{-1/4})$ for any large $t$. For experiments, we refine query embeddings via contrastive learning while adopting a disjoint parameter model to learn LLM-specific parameters. Experiments on SPROUT, EmbedLLM, and RouterBench datasets confirm that both algorithms consistently outperform baselines.

Problem

Research questions and friction points this paper is trying to address.

LLM routing

query scheduling

user retrials

queueing systems

implicit feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

contextual queueing bandits

implicit feedback

user retrials

LLM routing and scheduling

Thompson sampling

🔎 Similar Papers

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing

2024-08-24Citations: 0

Efficient Sequential Decision Making with Large Language Models

2024-06-17Conference on Empirical Methods in Natural Language ProcessingCitations: 3

Authors to Follow