๐ค AI Summary
This paper addresses the risk-sensitive multi-armed bandit problem and identifies a fundamental limitation of the conventional single-arm optimality assumption under generalized distortion-risk measures: for most risk measures, the optimal policy requires mixing across multiple arms. To address this, we formally define and prove arm-mixing optimalityโthe first such result in the literature. We propose an adaptive algorithm capable of uniformly tracking either mixed or pure optimal policies. Leveraging asymptotically optimal sampling design and risk-aware regret analysis, we establish a convergence regret bound of $O((log T/T)^
u)$ for some $
u > 0$, substantially improving upon existing rates. Our core contributions are threefold: (i) breaking the single-arm paradigm; (ii) establishing a general, unified framework for risk-sensitive bandits; and (iii) achieving theoretically optimal convergence speed under broad distortion-risk measures.
๐ Abstract
This paper introduces a general framework for risk-sensitive bandits that integrates the notions of risk-sensitive objectives by adopting a rich class of distortion riskmetrics. The introduced framework subsumes the various existing risk-sensitive models. An important and hitherto unknown observation is that for a wide range of riskmetrics, the optimal bandit policy involves selecting a mixture of arms. This is in sharp contrast to the convention in the multi-arm bandit algorithms that there is generally a solitary arm that maximizes the utility, whether purely reward-centric or risk-sensitive. This creates a major departure from the principles for designing bandit algorithms since there are uncountable mixture possibilities. The contributions of the paper are as follows: (i) it formalizes a general framework for risk-sensitive bandits, (ii) identifies standard risk-sensitive bandit models for which solitary arm selections is not optimal, (iii) and designs regret-efficient algorithms whose sampling strategies can accurately track optimal arm mixtures (when mixture is optimal) or the solitary arms (when solitary is optimal). The algorithms are shown to achieve a regret that scales according to $O((log T/T )^{
u})$, where $T$ is the horizon, and $
u>0$ is a riskmetric-specific constant.