Kernel Single-Index Bandits: Estimation, Inference, and Learning

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses contextual bandits with a single-index reward model, where the reward function is linked to an unknown nonparametric link and arm-specific index parameters. The authors propose a kernelized ε-greedy algorithm that combines Stein estimation for the index parameters with inverse-propensity-weighted kernel ridge regression to learn the reward function. Under adaptive sampling, the method overcomes challenges arising from dependence and variance inflation, establishing—for the first time—the asymptotic normality of the single-index estimator and a central limit theorem for directional functionals in the reproducing kernel Hilbert space (RKHS). This unifies the frameworks of online learning and statistical inference. The theoretical analysis yields an optimal finite-time regret bound of Õ(√T) and constructs asymptotically valid confidence intervals.

Technology Category

Application Category

📝 Abstract

We study contextual bandits with finitely many actions in which the reward of each arm follows a single-index model with an arm-specific index parameter and an unknown nonparametric link function. We consider a regime in which arms correspond to stable decision options and covariates evolve adaptively under the bandit policy. This setting creates significant statistical challenges: the sampling distribution depends on the allocation rule, observations are dependent over time, and inverse-propensity weighting induces variance inflation. We propose a kernelized $\varepsilon$-greedy algorithm that combines Stein-based estimation of the index parameters with inverse-propensity-weighted kernel ridge regression for the reward functions. This approach enables flexible semiparametric learning while retaining interpretability. Our analysis develops new tools for inference with adaptively collected data. We establish asymptotic normality for the single-index estimator under adaptive sampling, yielding valid confidence regions, and derive a directional functional central limit theorem for the RKHS estimator, which provides asymptotically valid pointwise confidence intervals. The analysis relies on concentration bounds for inverse-weighted Gram matrices together with martingale central limit theorems. We further obtain finite-time regret guarantees, including $\tilde{O}(\sqrt{T})$ rates under common-link Lipschitz conditions, showing that semiparametric structure can be exploited without sacrificing statistical efficiency. These results provide a unified framework for simultaneous learning and inference in single-index contextual bandits.

Problem

Research questions and friction points this paper is trying to address.

contextual bandits

single-index model

adaptive sampling

nonparametric link function

inverse-propensity weighting

Innovation

Methods, ideas, or system contributions that make the work stand out.

kernel single-index model

adaptive sampling

inverse-propensity weighting

semiparametric bandits

asymptotic inference

🔎 Similar Papers

No similar papers found.

Authors to Follow