🤖 AI Summary
This study addresses the challenge of achieving efficient learning in multi-armed Bayesian Bernoulli bandits without explicitly modeling epistemic uncertainty. To this end, the authors propose an annealed Softmax greedy policy that selects actions based on empirical means while constraining its deviation from a reference policy via KL regularization. Under a β-regularity assumption on the prior, the proposed strategy attains a near-optimal Bayesian regret bound: when the number of arms \( m = \Theta(\sqrt{T}) \), the Bayesian regret scales as \( \tilde{O}(\sqrt{T}) \), matching known lower bounds. The analysis also reveals that with fewer arms, the regret may become linear. This work thus provides theoretical justification for the effectiveness of uncertainty-agnostic strategies in Bayesian bandit settings.
📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) and group-based policy optimization methods such as GRPO update a stochastic policy by sampling multiple completions per prompt and increasing the policy's probability on those with higher reward, regularized by a KL penalty toward a reference policy. These updates do not include explicit mechanisms that track epistemic uncertainty. This paper studies a stylized explanation for why such uncertainty-agnostic updates can nevertheless be effective. We analyze an annealed softmax (Boltzmann) policy that selects actions according to a softmax of empirical mean rewards in a many-armed Bayesian Bernoulli bandit. Under a linear upper-tail condition on the prior (the $β=1$ case of $β$-regularity), which implies an abundance of near-optimal arms, we prove that annealed softmax greedy achieves Bayes regret $\tilde{O}(m + T/m)$, and in particular $\tilde{O}(\sqrt{T})$ when the number of arms scales as $m = Θ(\sqrt{T})$. This is the near-optimal Bayes regret rate in this regime, attained also by empirical-mean greedy. Under $β$-regularity, many arms maintain empirical means close to the optimum throughout learning, so when softmax samples an arm other than the empirically best, that arm tends to be another near-optimal one rather than a clearly inferior one. By contrast, with a small number of arms, the same kind of softmax policy can suffer linear regret. The result also provides a structural analogy to RLVR, where a base policy with a non-negligible probability of producing a correct completion plays the role of $β$-regularity.