🤖 AI Summary
This paper investigates the optimal regret bound for nonparametric contextual bandits under unbounded context distributions. Addressing the limitation of prior work—restricted to bounded-support contexts—we establish, for the first time, a minimax-optimal lower bound on regret in the unbounded setting. We propose two nearest-neighbor UCB algorithms: a fixed-$k$ variant and an adaptive-$k$ variant. The latter achieves near-optimality by jointly optimizing bias-variance and exploration-exploitation trade-offs via data-driven $k$ selection, modeling the Tsybakov margin condition, and parameterizing tail behavior. It attains an expected regret bound of $ ilde{O}ig(T^{1-frac{(alpha+1)eta}{alpha+(d+2)eta}} + T^{1-eta}ig)$, matching the information-theoretic lower bound up to logarithmic factors and substantially improving upon existing results for bounded-support contexts.
📝 Abstract
Nonparametric contextual bandit is an important model of sequential decision making problems. Under $alpha$-Tsybakov margin condition, existing research has established a regret bound of $ ilde{O}left(T^{1-frac{alpha+1}{d+2}}
ight)$ for bounded supports. However, the optimal regret with unbounded contexts has not been analyzed. The challenge of solving contextual bandit problems with unbounded support is to achieve both exploration-exploitation tradeoff and bias-variance tradeoff simultaneously. In this paper, we solve the nonparametric contextual bandit problem with unbounded contexts. We propose two nearest neighbor methods combined with UCB exploration. The first method uses a fixed $k$. Our analysis shows that this method achieves minimax optimal regret under a weak margin condition and relatively light-tailed context distributions. The second method uses adaptive $k$. By a proper data-driven selection of $k$, this method achieves an expected regret of $ ilde{O}left(T^{1-frac{(alpha+1)eta}{alpha+(d+2)eta}}+T^{1-eta}
ight)$, in which $eta$ is a parameter describing the tail strength. This bound matches the minimax lower bound up to logarithm factors, indicating that the second method is approximately optimal.