Risk-sensitive Bandits: Arm Mixture Optimality and Regret-efficient Algorithms

๐Ÿ“… 2025-03-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper addresses the risk-sensitive multi-armed bandit problem and identifies a fundamental limitation of the conventional single-arm optimality assumption under generalized distortion-risk measures: for most risk measures, the optimal policy requires mixing across multiple arms. To address this, we formally define and prove arm-mixing optimalityโ€”the first such result in the literature. We propose an adaptive algorithm capable of uniformly tracking either mixed or pure optimal policies. Leveraging asymptotically optimal sampling design and risk-aware regret analysis, we establish a convergence regret bound of $O((log T/T)^ u)$ for some $ u > 0$, substantially improving upon existing rates. Our core contributions are threefold: (i) breaking the single-arm paradigm; (ii) establishing a general, unified framework for risk-sensitive bandits; and (iii) achieving theoretically optimal convergence speed under broad distortion-risk measures.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper introduces a general framework for risk-sensitive bandits that integrates the notions of risk-sensitive objectives by adopting a rich class of distortion riskmetrics. The introduced framework subsumes the various existing risk-sensitive models. An important and hitherto unknown observation is that for a wide range of riskmetrics, the optimal bandit policy involves selecting a mixture of arms. This is in sharp contrast to the convention in the multi-arm bandit algorithms that there is generally a solitary arm that maximizes the utility, whether purely reward-centric or risk-sensitive. This creates a major departure from the principles for designing bandit algorithms since there are uncountable mixture possibilities. The contributions of the paper are as follows: (i) it formalizes a general framework for risk-sensitive bandits, (ii) identifies standard risk-sensitive bandit models for which solitary arm selections is not optimal, (iii) and designs regret-efficient algorithms whose sampling strategies can accurately track optimal arm mixtures (when mixture is optimal) or the solitary arms (when solitary is optimal). The algorithms are shown to achieve a regret that scales according to $O((log T/T )^{ u})$, where $T$ is the horizon, and $ u>0$ is a riskmetric-specific constant.
Problem

Research questions and friction points this paper is trying to address.

Introduces a framework for risk-sensitive bandits using distortion riskmetrics.
Identifies conditions where optimal policies require arm mixtures, not single arms.
Designs regret-efficient algorithms for tracking optimal arm mixtures or single arms.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates distortion riskmetrics for risk-sensitive objectives
Identifies optimal bandit policies involving arm mixtures
Designs regret-efficient algorithms tracking optimal arm mixtures
๐Ÿ”Ž Similar Papers
No similar papers found.