Kernel Single-Index Bandits: Estimation, Inference, and Learning

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses contextual bandits with a single-index reward model, where the reward function is linked to an unknown nonparametric link and arm-specific index parameters. The authors propose a kernelized ε-greedy algorithm that combines Stein estimation for the index parameters with inverse-propensity-weighted kernel ridge regression to learn the reward function. Under adaptive sampling, the method overcomes challenges arising from dependence and variance inflation, establishing—for the first time—the asymptotic normality of the single-index estimator and a central limit theorem for directional functionals in the reproducing kernel Hilbert space (RKHS). This unifies the frameworks of online learning and statistical inference. The theoretical analysis yields an optimal finite-time regret bound of Õ(√T) and constructs asymptotically valid confidence intervals.

Technology Category

Application Category

📝 Abstract
We study contextual bandits with finitely many actions in which the reward of each arm follows a single-index model with an arm-specific index parameter and an unknown nonparametric link function. We consider a regime in which arms correspond to stable decision options and covariates evolve adaptively under the bandit policy. This setting creates significant statistical challenges: the sampling distribution depends on the allocation rule, observations are dependent over time, and inverse-propensity weighting induces variance inflation. We propose a kernelized $\varepsilon$-greedy algorithm that combines Stein-based estimation of the index parameters with inverse-propensity-weighted kernel ridge regression for the reward functions. This approach enables flexible semiparametric learning while retaining interpretability. Our analysis develops new tools for inference with adaptively collected data. We establish asymptotic normality for the single-index estimator under adaptive sampling, yielding valid confidence regions, and derive a directional functional central limit theorem for the RKHS estimator, which provides asymptotically valid pointwise confidence intervals. The analysis relies on concentration bounds for inverse-weighted Gram matrices together with martingale central limit theorems. We further obtain finite-time regret guarantees, including $\tilde{O}(\sqrt{T})$ rates under common-link Lipschitz conditions, showing that semiparametric structure can be exploited without sacrificing statistical efficiency. These results provide a unified framework for simultaneous learning and inference in single-index contextual bandits.
Problem

Research questions and friction points this paper is trying to address.

contextual bandits
single-index model
adaptive sampling
nonparametric link function
inverse-propensity weighting
Innovation

Methods, ideas, or system contributions that make the work stand out.

kernel single-index model
adaptive sampling
inverse-propensity weighting
semiparametric bandits
asymptotic inference
🔎 Similar Papers
No similar papers found.