🤖 AI Summary
This work addresses the challenge of achieving the optimal $\tilde{O}(\sqrt{dT})$ regret bound in logistic multi-armed bandits without relying on contextual diversity assumptions—such as a strictly positive minimum eigenvalue of the covariance matrix. The authors propose SupSplitLog, an algorithm that leverages sample splitting to separately estimate an initial point and perform a Newton-type one-step correction. This approach is the first to attain the $\tilde{O}(\sqrt{dT})$ regret bound without requiring contextual diversity, while also improving the dependence on the dimension $d$. Moreover, SupSplitLog adaptively exploits data complexity, yielding tighter regret bounds when the reward function lies in a low-dimensional effective subspace. Both theoretical analysis and empirical experiments demonstrate the algorithm’s superiority over existing methods.
📝 Abstract
We study the $K$-armed logistic bandit problem, where at each round, the agent observes $K$ feature vectors associated with $K$ actions. Existing approaches that achieve a rate-optimal $\tilde{\mathcal{O}}(\sqrt{dT})$ regret bound rely heavily on context diversity assumptions, such as strict positivity of the minimum eigenvalue of a context covariance matrix. These assumptions, however, impose strong restrictions on the context process, as they rule out the situation where the context vectors are concentrated in a low-dimensional subspace. In this paper, we propose SupSplitLog, which, to the best of our knowledge, is the first algorithm for logistic bandits that achieves $\tilde{\mathcal{O}}(\sqrt{dT})$ regret without any context diversity assumption. The key idea is to split the collected samples into two disjoint subsets when constructing estimators; one is used to compute an initial-point estimator, while the other is used to apply a Newton-type one-step correction procedure. The splitting rule is carefully designed to balance the accuracy requirements of the initial-point estimator and the one-step correction procedure. Moreover, SupSplitLog strictly improves on the existing algorithms in terms of the dependence on dimension $d$ in the regret upper bound. Furthermore, SupSplitLog can be adapted simply to deduce a regret bound that grows with a data-dependent complexity measure, avoiding a direct dependence on $d$, which is favorable when the context vectors are concentrated in a low-dimensional subspace. We also provide experimental results that demonstrate numerically the superiority of our algorithm, validating the theoretical results.