Sample Complexity Bounds for Linear Constrained MDPs with a Generative Model

📅 2025-07-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies sample-efficient algorithms for infinite-horizon discounted linearly constrained Markov decision processes (CMDPs), addressing both relaxed feasibility (allowing small constraint violations) and strict feasibility (zero violations). We propose the first generic framework that integrates any unconstrained MDP solver into a primal-dual optimization scheme to handle linear constraints. Our method employs mirror-descent value iteration (MDVI) as a subroutine, combined with generative model sampling and linear function approximation. Theoretically, we establish near-optimal sample complexity bounds dependent on feature dimension $d$, discount factor $gamma$, accuracy $varepsilon$, and Slater constant $zeta$: $ ilde{O}ig(d^2 / ((1-gamma)^4 varepsilon^2)ig)$ for relaxed feasibility and $ ilde{O}ig(d^2 / ((1-gamma)^6 varepsilon^2 zeta^2)ig)$ for strict feasibility. These bounds recover near-optimal rates for tabular CMDPs as a special case, demonstrating tight dependence on all key parameters.

Technology Category

Application Category

📝 Abstract
We consider infinite-horizon $γ$-discounted (linear) constrained Markov decision processes (CMDPs) where the objective is to find a policy that maximizes the expected cumulative reward subject to expected cumulative constraints. Given access to a generative model, we propose to solve CMDPs with a primal-dual framework that can leverage any black-box unconstrained MDP solver. For linear CMDPs with feature dimension $d$, we instantiate the framework by using mirror descent value iteration ( exttt{MDVI})~citep{kitamura2023regularization} an example MDP solver. We provide sample complexity bounds for the resulting CMDP algorithm in two cases: (i) relaxed feasibility, where small constraint violations are allowed, and (ii) strict feasibility, where the output policy is required to exactly satisfy the constraint. For (i), we prove that the algorithm can return an $ε$-optimal policy with high probability by using $ ilde{O}left(frac{d^2}{(1-γ)^4ε^2} ight)$ samples. We note that these results exhibit a near-optimal dependence on both $d$ and $ε$. For (ii), we show that the algorithm requires $ ilde{O}left(frac{d^2}{(1-γ)^6ε^2ζ^2} ight)$ samples, where $ζ$ is the problem-dependent Slater constant that characterizes the size of the feasible region. Finally, we instantiate our framework for tabular CMDPs and show that it can be used to recover near-optimal sample complexities in this setting.
Problem

Research questions and friction points this paper is trying to address.

Maximizing reward in constrained MDPs with linear features
Providing sample complexity bounds for relaxed feasibility
Deriving sample bounds for strict feasibility using Slater constant
Innovation

Methods, ideas, or system contributions that make the work stand out.

Primal-dual framework for constrained MDPs
Mirror descent value iteration solver
Sample complexity bounds analysis
X
Xingtu Liu
Simon Fraser University
L
Lin F. Yang
University of California, Los Angeles
Sharan Vaswani
Sharan Vaswani
Simon Fraser University
Machine LearningOptimizationArtificial Intelligence