Near-Optimal Primal-Dual Algorithm for Learning Linear Mixture CMDPs with Adversarial Rewards

📅 2026-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses finite-horizon linear mixture constrained Markov decision processes (CMDPs) with adversarial rewards and unknown transition kernels, proposing the first theoretically efficient primal-dual policy optimization algorithm. The method integrates regularized dual updates, weighted ridge regression for parameter estimation, and confidence interval construction, while introducing a drift-based analysis to handle cases where strong duality fails. Under full-information feedback, the algorithm simultaneously controls regret and constraint violation, achieving a bound of Õ(√(d²H³K)), which matches the known minimax lower bound up to logarithmic factors in the feature dimension d, horizon H, and number of episodes K. This result substantially advances the theoretical foundations of adversarial safe reinforcement learning.
📝 Abstract
We study safe reinforcement learning in finite-horizon linear mixture constrained Markov decision processes (CMDPs) with adversarial rewards under full-information feedback and an unknown transition kernel. We propose a primal-dual policy optimization algorithm that achieves regret and constraint violation bounds of $\widetilde{O}(\sqrt{d^2 H^3 K})$ under mild conditions, where $d$ is the feature dimension, $H$ is the horizon, and $K$ is the number of episodes. To the best of our knowledge, this is the first provably efficient algorithm for linear mixture CMDPs with adversarial rewards. In particular, our regret bound is near-optimal, matching the known minimax lower bound up to logarithmic factors. The key idea is to introduce a regularized dual update that enables a drift-based analysis. This step is essential, as strong duality-based analysis cannot be directly applied when reward functions change across episodes. In addition, we extend weighted ridge regression-based parameter estimation to the constrained setting, allowing us to construct tighter confidence intervals that are crucial for deriving the near-optimal regret bound.
Problem

Research questions and friction points this paper is trying to address.

Constrained Markov Decision Processes
Adversarial Rewards
Safe Reinforcement Learning
Linear Mixture Models
Unknown Transition Kernel
Innovation

Methods, ideas, or system contributions that make the work stand out.

primal-dual algorithm
linear mixture CMDPs
adversarial rewards
near-optimal regret
regularized dual update
🔎 Similar Papers
No similar papers found.
K
Kihyun Yu
Department of Industrial and Systems Engineering, KAIST
S
Seoungbin Bae
Department of Industrial and Systems Engineering, KAIST
Dabeen Lee
Dabeen Lee
Department of Mathematical Sciences, Seoul National University
OptimizationMathematical ProgrammingAlgorithmsMachine Learning