🤖 AI Summary
This work addresses the problem of context-aware scheduling in queueing systems with unknown service rates, aiming to minimize the gap between the resulting queue length and that of an optimal policy. The authors formulate this as a contextual queueing multi-armed bandit problem, where tasks are dynamically matched to servers by learning a logistic model linking task features to server parameters to maximize service rates. Performance is measured via queue-length regret. The paper introduces a novel policy-switching queueing framework and a coupling-based analysis that, for the first time, decomposes queue-length regret into short-term decision errors and long-term state deviations. Two algorithms—CQB-ε and CQB-Opt—are proposed, achieving Õ(T⁻¹/⁴) regret under stochastic contexts and O(log²T) regret under adversarial contexts. Both theoretical analysis and experiments validate their effectiveness.
📝 Abstract
We introduce contextual queueing bandits, a new context-aware framework for scheduling while simultaneously learning unknown service rates. Individual jobs carry heterogeneous contextual features, based on which the agent chooses a job and matches it with a server to maximize the departure rate. The service/departure rate is governed by a logistic model of the contextual feature with an unknown server-specific parameter. To evaluate the performance of a policy, we consider queue length regret, defined as the difference in queue length between the policy and the optimal policy. The main challenge in the analysis is that the lists of remaining job features in the queue may differ under our policy versus the optimal policy for a given time step, since they may process jobs in different orders. To address this, we propose the idea of policy-switching queues equipped with a sophisticated coupling argument. This leads to a novel queue length regret decomposition framework, allowing us to understand the short-term effect of choosing a suboptimal job-server pair and its long-term effect on queue state differences. We show that our algorithm, CQB-$\varepsilon$, achieves a regret upper bound of $\widetilde{\mathcal{O}}(T^{-1/4})$. We also consider the setting of adversarially chosen contexts, for which our second algorithm, CQB-Opt, achieves a regret upper bound of $\mathcal{O}(\log^2 T)$. Lastly, we provide experimental results that validate our theoretical findings.