🤖 AI Summary
Existing Conditional Value-at-Risk policy gradient (CVaR-PG) methods achieve risk-sensitive optimization by discarding low-return trajectories, resulting in severe sample inefficiency. This work proposes **Return Capping**, a novel mechanism that imposes a hard upper bound on trajectory returns during training and incorporates a reweighting estimator—provably equivalent to the original CVaR objective without information loss. Unlike prior approaches, Return Capping shifts CVaR-PG’s sample efficiency paradigm from *discarding* to *reusing* trajectories, substantially improving data utilization. Empirically, the method achieves faster convergence and enhanced robustness across multiple benchmark environments. It improves sample efficiency by several-fold over state-of-the-art baselines while maintaining theoretical equivalence to the CVaR optimization goal. This work thus provides the first provably equivalent, high-efficiency policy gradient implementation for risk-sensitive reinforcement learning.
📝 Abstract
When optimising for conditional value at risk (CVaR) using policy gradients (PG), current methods rely on discarding a large proportion of trajectories, resulting in poor sample efficiency. We propose a reformulation of the CVaR optimisation problem by capping the total return of trajectories used in training, rather than simply discarding them, and show that this is equivalent to the original problem if the cap is set appropriately. We show, with empirical results in an number of environments, that this reformulation of the problem results in consistently improved performance compared to baselines.