Neural Contextual Bandits Under Delayed Feedback Constraints

📅 2025-04-16

📈 Citations: 0

✨ Influential: 0

career value

254K/year

🤖 AI Summary

This work addresses the neural contextual bandit problem under delayed feedback—where rewards for actions are observed only after an unknown, random delay—a realistic challenge in applications such as online recommendation and clinical trials. We propose the first adaptive online learning framework that explicitly incorporates delay modeling into both Neural Upper Confidence Bound (Neural-UCB) and Neural Thompson Sampling (Neural-TS). Theoretically, we establish an $O(sqrt{T})$ cumulative regret bound under sub-exponential delay distributions, relaxing the standard assumption of immediate feedback. Our method leverages deep neural networks to extract high-dimensional context features and introduces delay-aware confidence intervals and posterior sampling mechanisms. Extensive experiments on benchmark datasets—including MNIST and Mushroom—demonstrate significant improvements over existing baselines. The approach is robust to stochastic delays ranging from several to hundreds of time steps, offering both theoretical rigor and practical efficacy.

Technology Category

Application Category

📝 Abstract

This paper presents a new algorithm for neural contextual bandits (CBs) that addresses the challenge of delayed reward feedback, where the reward for a chosen action is revealed after a random, unknown delay. This scenario is common in applications such as online recommendation systems and clinical trials, where reward feedback is delayed because the outcomes or results of a user's actions (such as recommendations or treatment responses) take time to manifest and be measured. The proposed algorithm, called Delayed NeuralUCB, uses an upper confidence bound (UCB)-based exploration strategy. Under the assumption of independent and identically distributed sub-exponential reward delays, we derive an upper bound on the cumulative regret over a T-length horizon. We further consider a variant of the algorithm, called Delayed NeuralTS, that uses Thompson Sampling-based exploration. Numerical experiments on real-world datasets, such as MNIST and Mushroom, along with comparisons to benchmark approaches, demonstrate that the proposed algorithms effectively manage varying delays and are well-suited for complex real-world scenarios.

Problem

Research questions and friction points this paper is trying to address.

Addresses delayed reward feedback in neural contextual bandits

Proposes algorithms for handling random unknown reward delays

Evaluates performance on real-world datasets with varying delays

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural contextual bandits with delayed feedback

UCB-based exploration for delayed rewards

Thompson Sampling variant for complex scenarios

🔎 Similar Papers

Neural Dueling Bandits