Logistic Bandits with $\tilde{O}(\sqrt{dT})$ Regret without Context Diversity Assumptions

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the challenge of achieving the optimal $\tilde{O}(\sqrt{dT})$ regret bound in logistic multi-armed bandits without relying on contextual diversity assumptions—such as a strictly positive minimum eigenvalue of the covariance matrix. The authors propose SupSplitLog, an algorithm that leverages sample splitting to separately estimate an initial point and perform a Newton-type one-step correction. This approach is the first to attain the $\tilde{O}(\sqrt{dT})$ regret bound without requiring contextual diversity, while also improving the dependence on the dimension $d$. Moreover, SupSplitLog adaptively exploits data complexity, yielding tighter regret bounds when the reward function lies in a low-dimensional effective subspace. Both theoretical analysis and empirical experiments demonstrate the algorithm’s superiority over existing methods.

Technology Category

Application Category

📝 Abstract

We study the $K$-armed logistic bandit problem, where at each round, the agent observes $K$ feature vectors associated with $K$ actions. Existing approaches that achieve a rate-optimal $\tilde{\mathcal{O}}(\sqrt{dT})$ regret bound rely heavily on context diversity assumptions, such as strict positivity of the minimum eigenvalue of a context covariance matrix. These assumptions, however, impose strong restrictions on the context process, as they rule out the situation where the context vectors are concentrated in a low-dimensional subspace. In this paper, we propose SupSplitLog, which, to the best of our knowledge, is the first algorithm for logistic bandits that achieves $\tilde{\mathcal{O}}(\sqrt{dT})$ regret without any context diversity assumption. The key idea is to split the collected samples into two disjoint subsets when constructing estimators; one is used to compute an initial-point estimator, while the other is used to apply a Newton-type one-step correction procedure. The splitting rule is carefully designed to balance the accuracy requirements of the initial-point estimator and the one-step correction procedure. Moreover, SupSplitLog strictly improves on the existing algorithms in terms of the dependence on dimension $d$ in the regret upper bound. Furthermore, SupSplitLog can be adapted simply to deduce a regret bound that grows with a data-dependent complexity measure, avoiding a direct dependence on $d$, which is favorable when the context vectors are concentrated in a low-dimensional subspace. We also provide experimental results that demonstrate numerically the superiority of our algorithm, validating the theoretical results.

Innovation

Methods, ideas, or system contributions that make the work stand out.

logistic bandits

regret minimization

context diversity

sample splitting

Newton-type correction

🔎 Similar Papers

Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

2024-05-16arXiv.orgCitations: 2

Amazon

Arlington, VA, USA / Bellevue, WA, USA / Boston, MA, USA