🤖 AI Summary
This paper studies the sample complexity of iterative Conditional Value-at-Risk (CVaR) optimization in risk-sensitive reinforcement learning under generative models, aiming to characterize the minimal number of samples required to obtain an ε-optimal policy. We establish, for the first time, a rigorous equivalence between iterative CVaR RL and (s,a)-rectangular distributionally robust RL. Building on this, we propose ICVaR-VI—a value-iteration algorithm that integrates distributional robustness modeling with minimax lower-bound construction—yielding tight, matching upper and lower bounds. Our analysis reveals the optimal dependence of sample complexity on risk tolerance τ, discount factor γ, and accuracy ε: Õ(SA/((1−γ)⁴τ²ε²)) in general, improving to Õ(SA/((1−γ)³ε²)) when τ ≥ γ; the lower bound is fully matched. Furthermore, in the worst-path RL limit, we derive a tight bound of Õ(SA/pₘᵢₙ).
📝 Abstract
In this work, we study the sample complexity problem of risk-sensitive Reinforcement Learning (RL) with a generative model, where we aim to maximize the Conditional Value at Risk (CVaR) with risk tolerance level $ au$ at each step, named Iterated CVaR. %We consider the sample complexity of obtaining an $epsilon$-optimal policy in an infinite horizon discounted MDP, given access to a generative model. % We first build a connection between Iterated CVaR RL with $(s, a)$-rectangular distributional robust RL with the specific uncertainty set for CVaR. We develop nearly matching upper and lower bounds on the sample complexity for this problem. Specifically, we first prove that a value iteration-based algorithm, ICVaR-VI, achieves an $epsilon$-optimal policy with at most $ ilde{{O}}left(frac{SA}{(1-gamma)^4 au^2epsilon^2}
ight)$ samples, where $gamma$ is the discount factor, and $S, A$ are the sizes of the state and action spaces. Furthermore, if $ au geq gamma$, then the sample complexity can be further improved to $ ilde{{O}}left( frac{SA}{(1-gamma)^3epsilon^2}
ight)$. We further show a minimax lower bound of ${ ilde{{O}}}left(frac{(1-gamma au)SA}{(1-gamma)^4 auepsilon^2}
ight)$. For a constant risk level $0< auleq 1$, our upper and lower bounds match with each other, demonstrating the tightness and optimality of our analyses. We also investigate a limiting case with a small risk level $ au$, called Worst-Path RL, where the objective is to maximize the minimum possible cumulative reward. We develop matching upper and lower bounds of $ ilde{{O}}left(frac{SA}{p_{min}}
ight)$, where $p_{min}$ denotes the minimum non-zero reaching probability of the transition kernel.