Burning RED: Unlocking Subtask-Driven Reinforcement Learning and Risk-Awareness in Average-Reward Markov Decision Processes

📅 2024-10-14
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multi-objective online learning and risk-sensitive optimization in average-reward Markov decision processes (MDPs) remain underexplored. Method: This paper proposes the Reward-Extended Differential (RED) reinforcement learning framework, the first to enable fully online Conditional Value-at-Risk (CVaR) optimization under the average-reward criterion—without resorting to bilevel optimization or state augmentation—by uncovering and exploiting an intrinsic differential structure. The framework comprises RED prediction and control algorithms, a subtask decoupling mechanism, and a risk-aware policy update, all endowed with rigorous convergence guarantees. Results: On standard benchmarks, RED significantly improves multi-objective learning efficiency and achieves, for the first time, fully online, provably convergent CVaR optimization in average-reward MDPs.

Technology Category

Application Category

📝 Abstract
Average-reward Markov decision processes (MDPs) provide a foundational framework for sequential decision-making under uncertainty. However, average-reward MDPs have remained largely unexplored in reinforcement learning (RL) settings, with the majority of RL-based efforts having been allocated to discounted MDPs. In this work, we study a unique structural property of average-reward MDPs and utilize it to introduce Reward-Extended Differential (or RED) reinforcement learning: a novel RL framework that can be used to effectively and efficiently solve various learning objectives, or subtasks, simultaneously in the average-reward setting. We introduce a family of RED learning algorithms for prediction and control, including proven-convergent algorithms for the tabular case. We then showcase the power of these algorithms by demonstrating how they can be used to learn a policy that optimizes, for the first time, the well-known conditional value-at-risk (CVaR) risk measure in a fully-online manner, without the use of an explicit bi-level optimization scheme or an augmented state-space.
Problem

Research questions and friction points this paper is trying to address.

Develop subtask-driven reinforcement learning for average-reward MDPs
Introduce RED framework for simultaneous learning objectives
Optimize conditional value-at-risk measure in online manner
Innovation

Methods, ideas, or system contributions that make the work stand out.

RED reinforcement learning framework
Convergent algorithms for tabular cases
Online CVaR optimization without bi-level scheme
🔎 Similar Papers
No similar papers found.