Online Bayesian Risk-Averse Reinforcement Learning

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Reinforcement learning (RL) agents often face risk-sensitive decision-making under epistemic uncertainty—i.e., uncertainty arising from limited knowledge about model parameters. This uncertainty can lead to overconfident or unsafe policies, especially in low-data regimes. Method: We propose a Bayesian risk-averse framework grounded in the Bayesian risk Markov decision process (BR-MDP), which explicitly models parameter priors and posteriors. Posterior sampling enables principled online exploration–exploitation trade-offs, and we establish, for the first time, that the Bayesian risk value function systematically underestimates the true value function in a pessimistic manner. Contribution/Results: Leveraging this insight, we design an adaptive risk-averse algorithm applicable to both general RL and contextual bandits. We introduce and analyze Bayesian risk regret, proving a sublinear regret bound. Theoretical analysis and empirical evaluation confirm that the method effectively mitigates epistemic uncertainty with limited data, while the induced risk bias vanishes asymptotically as data accumulates—ensuring both theoretical soundness and practical adaptability.

Technology Category

Application Category

📝 Abstract

In this paper, we study the Bayesian risk-averse formulation in reinforcement learning (RL). To address the epistemic uncertainty due to a lack of data, we adopt the Bayesian Risk Markov Decision Process (BRMDP) to account for the parameter uncertainty of the unknown underlying model. We derive the asymptotic normality that characterizes the difference between the Bayesian risk value function and the original value function under the true unknown distribution. The results indicate that the Bayesian risk-averse approach tends to pessimistically underestimate the original value function. This discrepancy increases with stronger risk aversion and decreases as more data become available. We then utilize this adaptive property in the setting of online RL as well as online contextual multi-arm bandits (CMAB), a special case of online RL. We provide two procedures using posterior sampling for both the general RL problem and the CMAB problem. We establish a sub-linear regret bound, with the regret defined as the conventional regret for both the RL and CMAB settings. Additionally, we establish a sub-linear regret bound for the CMAB setting with the regret defined as the Bayesian risk regret. Finally, we conduct numerical experiments to demonstrate the effectiveness of the proposed algorithm in addressing epistemic uncertainty and verifying the theoretical properties.

Problem

Research questions and friction points this paper is trying to address.

Addresses Bayesian risk-averse reinforcement learning under uncertainty

Derives asymptotic normality between Bayesian and true value functions

Provides sub-linear regret bounds for online RL and CMAB settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian Risk Markov Decision Process model

Posterior sampling for online reinforcement learning

Sub-linear regret bound theoretical guarantees

🔎 Similar Papers

Uncertainty-aware Distributional Offline Reinforcement Learning