🤖 AI Summary
To address the lack of convergence guarantees and unified modeling capabilities in policy gradient methods for risk-sensitive reinforcement learning, this paper proposes the first provably convergent risk-sensitive distributed policy gradient framework. Methodologically, we (1) derive the first analytical gradient expression of the return distribution with respect to policy parameters; (2) design a novel algorithm—Categorical Distribution-based Policy Gradient (CDPG)—that simultaneously achieves finite-support optimality and finite-iteration convergence; and (3) ensure compatibility with a broad class of coherent risk measures. Theoretical analysis leverages tools from stochastic optimization to establish convergence and risk-sensitivity properties. Empirical evaluation on stochastic Cliffwalk and CartPole benchmarks demonstrates significant improvements in robustness, reliability, and risk mitigation compared to existing approaches.
📝 Abstract
Risk-sensitive reinforcement learning (RL) is crucial for maintaining reliable performance in high-stakes applications. While traditional RL methods aim to learn a point estimate of the random cumulative cost, distributional RL (DRL) seeks to estimate the entire distribution of it, which leads to a unified framework for handling different risk measures. However, developing policy gradient methods for risk-sensitive DRL is inherently more complex as it involves finding the gradient of a probability measure. This paper introduces a new policy gradient method for risk-sensitive DRL with general coherent risk measures, where we provide an analytical form of the probability measure's gradient for any distribution. For practical use, we design a categorical distributional policy gradient algorithm (CDPG) that approximates any distribution by a categorical family supported on some fixed points. We further provide a finite-support optimality guarantee and a finite-iteration convergence guarantee under inexact policy evaluation and gradient estimation. Through experiments on stochastic Cliffwalk and CartPole environments, we illustrate the benefits of considering a risk-sensitive setting in DRL.