Policy Gradient for Continuous-Time Robust Markov Decision Processes

๐Ÿ“… 2026-06-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

226K/year
๐Ÿค– AI Summary
This work addresses worst-case performance guarantees in continuous-time robust reinforcement learning by extending robust Markov decision processes (RMDPs) to a continuous-time framework. It establishes, for the first time, a policy gradient theory for such settings, deriving explicit expressions for policy and adversarial gradients via pathwise derivatives and adjoint methods. The authors propose a bilevel optimizer and a mean-field optimizer to solve the resulting problems, and introduce novel analytical tools tailored to undiscounted total-cost MDPs. Under an oracle setting, the algorithm achieves linear convergence; in the sample-based setting, it attains a sample complexity of ร•(1/ฮตยฒ). The mean-field optimizer further yields a convergence rate of ร•(1/K) and sample complexity of ร•(Nยฒ/ฮต), with empirical validation provided on neural ordinary differential equation dynamical systems.
๐Ÿ“ Abstract
The framework of robust Markov decision processes (RMDPs) allows the design of reinforcement learning agents that satisfy performance guarantees under worst-case transition dynamics. Traditional RMDPs consider discrete-time dynamics and recently, sample-efficient policy gradient algorithms have been considered in this context. This paper investigates policy gradient algorithms within a continuous-time RMDP framework. Policy gradients and adversarial gradients are derived using pathwise and adjoint-based formulas for stochastic and ordinary differential equations. We propose double-loop optimisers to obtain linear convergence in the oracle-based setting and an $\tilde{\mathcal{O}}(\frac{1}{ฮต^2})$ sample complexity in the sample-based setting in an analysis which also derives novel tools for the framework of undiscounted total cost MDPs. Additionally, we propose mean-field optimisers as distributional optimisers with an $\tilde{\mathcal{O}}(\frac{1}{K})$ oracle-based convergence rate and an $\tilde{\mathcal{O}}(\frac{N^2}ฮต)$ sample complexity under $N$-particle approximation. The effectiveness of continuous-time policy gradient algorithms is confirmed for both optimisers on continuous-time RMDPs with neural ordinary differential equation dynamics.
Problem

Research questions and friction points this paper is trying to address.

robust Markov decision processes
continuous-time
policy gradient
reinforcement learning
worst-case dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

continuous-time RMDP
policy gradient
adjoint method
double-loop optimizer
mean-field optimizer
๐Ÿ”Ž Similar Papers
No similar papers found.