Contextual Linear Bandits with Delay as Payoff

πŸ“… 2025-02-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper studies the contextual linear multi-armed bandit problem where delay is proportional to reward (or loss), relaxing the conventional assumption that delay is independent of feedback. It accommodates high-dimensional action spaces and dynamic contexts. We propose a phase-based arm elimination algorithm grounded in volumetric spanners, the first to achieve efficient regret control under delay-dependent feedback. The theoretical regret upper bound is $O(DDelta_{max}log T)$, further refined in the loss setting. We uncover a fundamental structural separation between regret behavior in reward and loss regimesβ€”a previously unobserved phenomenon. Empirical evaluations demonstrate substantial improvements over existing baselines across diverse benchmarks.

Technology Category

Application Category

πŸ“ Abstract
A recent work by Schlisselberg et al. (2024) studies a delay-as-payoff model for stochastic multi-armed bandits, where the payoff (either loss or reward) is delayed for a period that is proportional to the payoff itself. While this captures many real-world applications, the simple multi-armed bandit setting limits the practicality of their results. In this paper, we address this limitation by studying the delay-as-payoff model for contextual linear bandits. Specifically, we start from the case with a fixed action set and propose an efficient algorithm whose regret overhead compared to the standard no-delay case is at most $DDelta_{max}log T$, where $T$ is the total horizon, $D$ is the maximum delay, and $Delta_{max}$ is the maximum suboptimality gap. When payoff is loss, we also show further improvement of the bound, demonstrating a separation between reward and loss similar to Schlisselberg et al. (2024). Contrary to standard linear bandit algorithms that construct least squares estimator and confidence ellipsoid, the main novelty of our algorithm is to apply a phased arm elimination procedure by only picking actions in a volumetric spanner of the action set, which addresses challenges arising from both payoff-dependent delays and large action sets. We further extend our results to the case with varying action sets by adopting the reduction from Hanna et al. (2023). Finally, we implement our algorithm and showcase its effectiveness and superior performance in experiments.
Problem

Research questions and friction points this paper is trying to address.

Contextual linear bandits with delay
Efficient algorithm for payoff-dependent delays
Phased arm elimination in volumetric spanner
Innovation

Methods, ideas, or system contributions that make the work stand out.

Phased arm elimination in volumetric spanner
Handles payoff-dependent delays efficiently
Extends to varying action sets effectively
πŸ”Ž Similar Papers
No similar papers found.