π€ AI Summary
This work addresses finite-horizon constrained Markov decision processes (CMDPs) with adversarial losses and stochastic cost feedback, a setting where existing methods struggle to provide performance guarantees. The paper proposes a novel primal-dual policy optimization algorithm that, for the first time, achieves sublinear regret and constraint violation bounds under this challenging formulation. Key innovations include a weighted LogSumExp softmax policy, a periodic policy mixing mechanism, and a regularized dual variable update rule. Theoretical analysis establishes that the algorithm attains an $\widetilde{O}(K^{3/4})$ bound on both regret and cumulative constraint violation over $K$ episodes. Numerical experiments further corroborate the algorithmβs empirical effectiveness.
π Abstract
Existing work on linear constrained Markov decision processes (CMDPs) has primarily focused on stochastic settings, where the losses and costs are either fixed or drawn from fixed distributions. However, such formulations are inherently vulnerable to adversarially changing environments. To overcome this limitation, we propose a primal-dual policy optimization algorithm for online finite-horizon {adversarial} linear CMDPs, where the losses are adversarially chosen under full-information feedback and the costs are stochastic under bandit feedback. Our algorithm is the \emph{first} to achieve sublinear regret and constraint violation bounds in this setting, both bounded by $\widetilde{\mathcal{O}}(K^{3/4})$, where $K$ denotes the number of episodes. The algorithm introduces and runs with a new class of policies, which we call weighted LogSumExp softmax policies, designed to adapt to adversarially chosen loss functions. Our main result stems from the following key contributions: (i) a new covering number argument for the weighted LogSumExp softmax policies, and (ii) two novel algorithmic components -- periodic policy mixing and a regularized dual update -- which allow us to effectively control both the covering number and the dual variable. We also report numerical results that validate our theoretical findings on the performance of the algorithm.