🤖 AI Summary
Multi-objective online learning and risk-sensitive optimization in average-reward Markov decision processes (MDPs) remain underexplored. Method: This paper proposes the Reward-Extended Differential (RED) reinforcement learning framework, the first to enable fully online Conditional Value-at-Risk (CVaR) optimization under the average-reward criterion—without resorting to bilevel optimization or state augmentation—by uncovering and exploiting an intrinsic differential structure. The framework comprises RED prediction and control algorithms, a subtask decoupling mechanism, and a risk-aware policy update, all endowed with rigorous convergence guarantees. Results: On standard benchmarks, RED significantly improves multi-objective learning efficiency and achieves, for the first time, fully online, provably convergent CVaR optimization in average-reward MDPs.
📝 Abstract
Average-reward Markov decision processes (MDPs) provide a foundational framework for sequential decision-making under uncertainty. However, average-reward MDPs have remained largely unexplored in reinforcement learning (RL) settings, with the majority of RL-based efforts having been allocated to discounted MDPs. In this work, we study a unique structural property of average-reward MDPs and utilize it to introduce Reward-Extended Differential (or RED) reinforcement learning: a novel RL framework that can be used to effectively and efficiently solve various learning objectives, or subtasks, simultaneously in the average-reward setting. We introduce a family of RED learning algorithms for prediction and control, including proven-convergent algorithms for the tabular case. We then showcase the power of these algorithms by demonstrating how they can be used to learn a policy that optimizes, for the first time, the well-known conditional value-at-risk (CVaR) risk measure in a fully-online manner, without the use of an explicit bi-level optimization scheme or an augmented state-space.