Algorithm for Contextual Queueing Bandits with Rate-Optimal Queue Length Regret

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the slow convergence of queue-length regret in heterogeneous task scheduling under unknown context-dependent service rates. The authors propose a three-stage adaptive algorithm, CQB-η-2, which begins with random exploration to obtain initial estimates, transitions to a hybrid phase combining η-greedy exploration with UCB to maintain negative drift, and finally switches to a pure UCB policy. By carefully controlling the intensity of exploration up to a well-chosen switching horizon, the method improves the upper bound on queue-length regret from Õ(T⁻¹/⁴) to Õ(T⁻¹/²). Furthermore, the paper establishes an Ω(T⁻¹/²) minimax lower bound, demonstrating that the proposed algorithm achieves theoretical optimality up to logarithmic factors and fully characterizes the optimal dependence on the time horizon T.

📝 Abstract

Contextual queueing bandits provide a framework for learning to schedule heterogeneous jobs under unknown context-dependent service rates. Under stochastic contexts, existing algorithms achieve $\widetilde{\mathcal{O}}(T^{-1/4})$ queue length regret, defined as the expected difference between the learner's and oracle's queue lengths at horizon $T$. In this paper, we improve this rate to $\widetilde{\mathcal{O}}(T^{-1/2})$. The key observation is that random exploration is needed only up to a carefully chosen cutoff round, rather than throughout the entire horizon. We propose CQB-$η$-2, a three-phase algorithm: (i) pure random exploration to construct an initial estimator, (ii) $η$-random exploration combined with a UCB rule to continue learning while maintaining negative drift, and (iii) pure UCB after the exploration cutoff. Our proof decomposes the queue length regret at the cutoff round. Before the cutoff, negative drift suppresses queue length differences caused by suboptimal choices. After the cutoff, the first two phases provide sufficient random exploration samples, ensuring that UCB decisions incur small departure-rate gaps. Combining these two bounds yields queue length regret of order $\widetilde{\mathcal{O}}(T^{-1/2})$. We further prove a minimax lower bound of order $Ω(T^{-1/2})$. The proof constructs two hard instances that are statistically indistinguishable up to the final service decision, and uses a queue-specific coupling argument to convert the resulting testing error into queue length regret. Together, our upper and lower bounds characterize the minimax dependence on the horizon $T$ up to logarithmic factors.

Problem

Research questions and friction points this paper is trying to address.

Contextual Queueing Bandits

Queue Length Regret

Service Rate

Stochastic Contexts

Minimax Lower Bound

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contextual Queueing Bandits

Queue Length Regret

Rate-Optimal Algorithm