🤖 AI Summary
This paper studies policy optimization in adversarial Markov decision processes (AMDPs), focusing on extending Q-learning to advantage-function-driven adversarial learning. We propose an expert-policy orchestration framework applicable to both oracle-assisted and realistic settings—where the transition kernel is unknown and value functions must be learned. Our approach formally models orchestration for the first time, integrating a small set of expert policies to mitigate sparse rewards and exploration challenges. We derive a value-function regret bound informed by adversarial insights and generalize natural policy gradient analysis to arbitrary adversarial aggregation and estimated advantage functions, establishing its simplicity and transparency. Empirical validation on a stochastic matching toy model confirms efficacy, with theoretical guarantees on sample complexity in both expectation and high-probability senses. The results unify generality, interpretability, and practicality, advancing foundational understanding of robust reinforcement learning under adversarial uncertainty.
📝 Abstract
Structured reinforcement learning leverages policies with advantageous properties to reach better performance, particularly in scenarios where exploration poses challenges. We explore this field through the concept of orchestration, where a (small) set of expert policies guides decision-making; the modeling thereof constitutes our first contribution. We then establish value-functions regret bounds for orchestration in the tabular setting by transferring regret-bound results from adversarial settings. We generalize and extend the analysis of natural policy gradient in Agarwal et al. [2021, Section 5.3] to arbitrary adversarial aggregation strategies. We also extend it to the case of estimated advantage functions, providing insights into sample complexity both in expectation and high probability. A key point of our approach lies in its arguably more transparent proofs compared to existing methods. Finally, we present simulations for a stochastic matching toy model.