Semi-Parametric Batched Global Multi-Armed Bandits with Covariates

📅 2025-03-01

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work studies the batched multi-armed bandit problem with covariates and cross-arm reward correlations—arising naturally in personalized medicine and recommender systems. To overcome the curse of dimensionality that plagues existing nonparametric approaches, we propose the first semiparametric batched framework: it models inter-arm reward dependence via single-index regression (SIR) and integrates dynamic binning with round-wise arm elimination in the BIDS algorithm. We establish minimax-optimal regret bounds under both settings—with and without prior knowledge of the index direction. Extensive simulations and real-data experiments demonstrate that our method significantly outperforms state-of-the-art nonparametric batched bandit algorithms, achieving both statistical efficiency and computational feasibility.

Technology Category

Application Category

📝 Abstract

The multi-armed bandits (MAB) framework is a widely used approach for sequential decision-making, where a decision-maker selects an arm in each round with the goal of maximizing long-term rewards. Moreover, in many practical applications, such as personalized medicine and recommendation systems, feedback is provided in batches, contextual information is available at the time of decision-making, and rewards from different arms are related rather than independent. We propose a novel semi-parametric framework for batched bandits with covariates and a shared parameter across arms, leveraging the single-index regression (SIR) model to capture relationships between arm rewards while balancing interpretability and flexibility. Our algorithm, Batched single-Index Dynamic binning and Successive arm elimination (BIDS), employs a batched successive arm elimination strategy with a dynamic binning mechanism guided by the single-index direction. We consider two settings: one where a pilot direction is available and another where the direction is estimated from data, deriving theoretical regret bounds for both cases. When a pilot direction is available with sufficient accuracy, our approach achieves minimax-optimal rates (with $d = 1$) for nonparametric batched bandits, circumventing the curse of dimensionality. Extensive experiments on simulated and real-world datasets demonstrate the effectiveness of our algorithm compared to the nonparametric batched bandit method introduced by cite{jiang2024batched}.

Problem

Research questions and friction points this paper is trying to address.

Develops a semi-parametric framework for batched multi-armed bandits with covariates.

Addresses sequential decision-making with batch feedback and contextual information.

Proposes an algorithm to optimize long-term rewards using dynamic binning and arm elimination.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-parametric framework for batched bandits

Single-index regression model for reward relationships

Dynamic binning and successive arm elimination

🔎 Similar Papers

Batched Nonparametric Contextual Bandits