🤖 AI Summary
This work studies the batched multi-armed bandit problem with covariates and cross-arm reward correlations—arising naturally in personalized medicine and recommender systems. To overcome the curse of dimensionality that plagues existing nonparametric approaches, we propose the first semiparametric batched framework: it models inter-arm reward dependence via single-index regression (SIR) and integrates dynamic binning with round-wise arm elimination in the BIDS algorithm. We establish minimax-optimal regret bounds under both settings—with and without prior knowledge of the index direction. Extensive simulations and real-data experiments demonstrate that our method significantly outperforms state-of-the-art nonparametric batched bandit algorithms, achieving both statistical efficiency and computational feasibility.
📝 Abstract
The multi-armed bandits (MAB) framework is a widely used approach for sequential decision-making, where a decision-maker selects an arm in each round with the goal of maximizing long-term rewards. Moreover, in many practical applications, such as personalized medicine and recommendation systems, feedback is provided in batches, contextual information is available at the time of decision-making, and rewards from different arms are related rather than independent. We propose a novel semi-parametric framework for batched bandits with covariates and a shared parameter across arms, leveraging the single-index regression (SIR) model to capture relationships between arm rewards while balancing interpretability and flexibility. Our algorithm, Batched single-Index Dynamic binning and Successive arm elimination (BIDS), employs a batched successive arm elimination strategy with a dynamic binning mechanism guided by the single-index direction. We consider two settings: one where a pilot direction is available and another where the direction is estimated from data, deriving theoretical regret bounds for both cases. When a pilot direction is available with sufficient accuracy, our approach achieves minimax-optimal rates (with $d = 1$) for nonparametric batched bandits, circumventing the curse of dimensionality. Extensive experiments on simulated and real-world datasets demonstrate the effectiveness of our algorithm compared to the nonparametric batched bandit method introduced by cite{jiang2024batched}.