Provably Efficient Sample Complexity for Robust CMDP

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies robust constrained Markov decision processes (RCMDPs) under environmental uncertainty: learning a policy that satisfies safety constraints—i.e., cumulative utility ≥ threshold—in the worst case over a known uncertainty set, while maximizing expected reward. We first prove that Markov policies are not necessarily optimal in RCMDPs, motivating a novel augmented state-space formulation based on residual utility budgets. Building on this, we propose Robust Constrained Value Iteration (RCVI), a model-based algorithm integrating generative model sampling, uncertainty set modeling, and constrained optimization. RCVI is the first to establish a sample complexity upper bound of $ ilde{O}(|S||A|H^5/varepsilon^2)$ for RCMDPs, guaranteeing—with high probability—that the output policy satisfies safety constraints up to violation $varepsilon$. This work advances both the theoretical foundations and practical applicability of safe reinforcement learning.

Technology Category

Application Category

📝 Abstract
We study the problem of learning policies that maximize cumulative reward while satisfying safety constraints, even when the real environment differs from a simulator or nominal model. We focus on robust constrained Markov decision processes (RCMDPs), where the agent must maximize reward while ensuring cumulative utility exceeds a threshold under the worst-case dynamics within an uncertainty set. While recent works have established finite-time iteration complexity guarantees for RCMDPs using policy optimization, their sample complexity guarantees remain largely unexplored. In this paper, we first show that Markovian policies may fail to be optimal even under rectangular uncertainty sets unlike the {em unconstrained} robust MDP. To address this, we introduce an augmented state space that incorporates the remaining utility budget into the state representation. Building on this formulation, we propose a novel Robust constrained Value iteration (RCVI) algorithm with a sample complexity of $mathcal{ ilde{O}}(|S||A|H^5/epsilon^2)$ achieving at most $epsilon$ violation using a generative model where $|S|$ and $|A|$ denote the sizes of the state and action spaces, respectively, and $H$ is the episode length. To the best of our knowledge, this is the {em first sample complexity guarantee} for RCMDP. Empirical results further validate the effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Learning safe reward-maximizing policies under model uncertainty
Addressing non-optimality of Markovian policies in robust CMDPs
Establishing first sample complexity guarantee for robust constrained MDPs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Augmented state space with utility budget
Robust constrained Value iteration algorithm
Sample complexity guarantee for RCMDP
🔎 Similar Papers
No similar papers found.