Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence

📅 2024-05-23

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

165K/year

🤖 AI Summary

To address the lack of convergence guarantees and unified modeling capabilities in policy gradient methods for risk-sensitive reinforcement learning, this paper proposes the first provably convergent risk-sensitive distributed policy gradient framework. Methodologically, we (1) derive the first analytical gradient expression of the return distribution with respect to policy parameters; (2) design a novel algorithm—Categorical Distribution-based Policy Gradient (CDPG)—that simultaneously achieves finite-support optimality and finite-iteration convergence; and (3) ensure compatibility with a broad class of coherent risk measures. Theoretical analysis leverages tools from stochastic optimization to establish convergence and risk-sensitivity properties. Empirical evaluation on stochastic Cliffwalk and CartPole benchmarks demonstrates significant improvements in robustness, reliability, and risk mitigation compared to existing approaches.

Technology Category

Application Category

📝 Abstract

Risk-sensitive reinforcement learning (RL) is crucial for maintaining reliable performance in high-stakes applications. While traditional RL methods aim to learn a point estimate of the random cumulative cost, distributional RL (DRL) seeks to estimate the entire distribution of it, which leads to a unified framework for handling different risk measures. However, developing policy gradient methods for risk-sensitive DRL is inherently more complex as it involves finding the gradient of a probability measure. This paper introduces a new policy gradient method for risk-sensitive DRL with general coherent risk measures, where we provide an analytical form of the probability measure's gradient for any distribution. For practical use, we design a categorical distributional policy gradient algorithm (CDPG) that approximates any distribution by a categorical family supported on some fixed points. We further provide a finite-support optimality guarantee and a finite-iteration convergence guarantee under inexact policy evaluation and gradient estimation. Through experiments on stochastic Cliffwalk and CartPole environments, we illustrate the benefits of considering a risk-sensitive setting in DRL.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Risk-sensitive Problems

Strategy Gradient Method

Innovation

Methods, ideas, or system contributions that make the work stand out.

Risk-sensitive DRL

Categorical Distribution Policy Gradient

Optimality and Convergence

🔎 Similar Papers

Uncertainty-aware Distributional Offline Reinforcement Learning