Contextual Linear Bandits with Delay as Payoff

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This paper studies the contextual linear multi-armed bandit problem where delay is proportional to reward (or loss), relaxing the conventional assumption that delay is independent of feedback. It accommodates high-dimensional action spaces and dynamic contexts. We propose a phase-based arm elimination algorithm grounded in volumetric spanners, the first to achieve efficient regret control under delay-dependent feedback. The theoretical regret upper bound is $O(DDelta_{max}log T)$, further refined in the loss setting. We uncover a fundamental structural separation between regret behavior in reward and loss regimes—a previously unobserved phenomenon. Empirical evaluations demonstrate substantial improvements over existing baselines across diverse benchmarks.

Technology Category

Application Category

📝 Abstract

A recent work by Schlisselberg et al. (2024) studies a delay-as-payoff model for stochastic multi-armed bandits, where the payoff (either loss or reward) is delayed for a period that is proportional to the payoff itself. While this captures many real-world applications, the simple multi-armed bandit setting limits the practicality of their results. In this paper, we address this limitation by studying the delay-as-payoff model for contextual linear bandits. Specifically, we start from the case with a fixed action set and propose an efficient algorithm whose regret overhead compared to the standard no-delay case is at most $DDelta_{max}log T$, where $T$ is the total horizon, $D$ is the maximum delay, and $Delta_{max}$ is the maximum suboptimality gap. When payoff is loss, we also show further improvement of the bound, demonstrating a separation between reward and loss similar to Schlisselberg et al. (2024). Contrary to standard linear bandit algorithms that construct least squares estimator and confidence ellipsoid, the main novelty of our algorithm is to apply a phased arm elimination procedure by only picking actions in a volumetric spanner of the action set, which addresses challenges arising from both payoff-dependent delays and large action sets. We further extend our results to the case with varying action sets by adopting the reduction from Hanna et al. (2023). Finally, we implement our algorithm and showcase its effectiveness and superior performance in experiments.

Problem

Research questions and friction points this paper is trying to address.

Contextual linear bandits with delay

Efficient algorithm for payoff-dependent delays

Phased arm elimination in volumetric spanner

Innovation

Methods, ideas, or system contributions that make the work stand out.

Phased arm elimination in volumetric spanner

Handles payoff-dependent delays efficiently

Extends to varying action sets effectively

🔎 Similar Papers

Neural Dueling Bandits