🤖 AI Summary
This paper studies the finite-armed semiparametric bandit problem, where rewards from each arm decompose into a linear component and an unknown, possibly adversarial offset term—unifying modeling of linear structure and nonlinear perturbations. We propose the first unified framework integrating orthogonalized regression, adaptive experimental design, and non-asymptotic statistical analysis. Our method achieves tight regret bounds, PAC guarantees, and optimal-arm identification simultaneously. Under general conditions, it attains the optimal $ ilde{O}(sqrt{dT})$ regret bound—matching the linear bandit lower bound—while achieving logarithmic regret when a positive gap exists, thereby meeting the minimax lower bound. The approach is both robust to adversarial offsets and computationally efficient, significantly extending the applicability of classical linear bandits beyond strict linearity assumptions.
📝 Abstract
We study finite-armed semiparametric bandits, where each arm's reward combines a linear component with an unknown, potentially adversarial shift. This model strictly generalizes classical linear bandits and reflects complexities common in practice. We propose the first experimental-design approach that simultaneously offers a sharp regret bound, a PAC bound, and a best-arm identification guarantee. Our method attains the minimax regret $ ilde{O}(sqrt{dT})$, matching the known lower bound for finite-armed linear bandits, and further achieves logarithmic regret under a positive suboptimality gap condition. These guarantees follow from our refined non-asymptotic analysis of orthogonalized regression that attains the optimal $sqrt{d}$ rate, paving the way for robust and efficient learning across a broad class of semiparametric bandit problems.