🤖 AI Summary
This work investigates the global convergence of policy gradient (PG) and natural policy gradient (NPG) methods for risk-sensitive reinforcement learning under expectation-based conditional risk measures (ECRMs). For time-consistent ECRMs, we develop a unified PG/NPG algorithmic framework covering four practical parameterizations: constrained direct parameterization, log-barrier regularized softmax, entropy-regularized softmax, and approximate NPG. We establish, for the first time, a rigorous global optimality guarantee and iteration complexity analysis—achieving $O(1/varepsilon^2)$ for PG and $O(1/varepsilon)$ for NPG—for ECRM-based risk optimization, thereby filling a critical theoretical gap in globally convergent risk-sensitive RL. Empirical evaluation on a stochastic Cliffwalk environment demonstrates that the proposed algorithms effectively mitigate risk while maintaining stability and convergence.
📝 Abstract
Risk-sensitive reinforcement learning (RL) has become a popular tool for controlling the risk of uncertain outcomes and ensuring reliable performance in highly stochastic sequential decision-making problems. While Policy Gradient (PG) methods have been developed for risk-sensitive RL, it remains unclear if these methods enjoy the same global convergence guarantees as in the risk-neutral case citep{mei2020global,agarwal2021theory,cen2022fast,bhandari2024global}. In this paper, we consider a class of dynamic time-consistent risk measures, named Expected Conditional Risk Measures (ECRMs), and derive PG and Natural Policy Gradient (NPG) updates for ECRMs-based RL problems. We provide global optimality {and iteration complexities} of the proposed algorithms under the following four settings: (i) PG with constrained direct parameterization, (ii) PG with softmax parameterization and log barrier regularization, (iii) NPG with softmax parameterization and entropy regularization, and (iv) approximate NPG with inexact policy evaluation. Furthermore, we test a risk-averse REINFORCE algorithm citep{williams1992simple} and a risk-averse NPG algorithm citep{kakade2001natural} on a stochastic Cliffwalk environment to demonstrate the efficacy of our methods and the importance of risk control.