🤖 AI Summary
This work studies reinforcement learning under resource constraints in unknown constrained Markov decision processes (CMDPs), where transition probabilities, rewards, and resource consumption functions are all unknown. The goal is instance-dependent sample-efficient learning. Addressing the $O(1/epsilon^2)$ sample complexity bottleneck of existing methods, we establish the first instance-dependent logarithmic regret bound for CMDPs. Our approach introduces a novel characterization of problem hardness based on the optimal basis of the linear programming (LP) formulation of CMDPs, along with an optimal-basis identification and elimination mechanism and a resource-adaptive re-solving framework. By performing online optimization in the primal space and explicitly modeling remaining resource budgets, we achieve a sample complexity of $Oig((1/(Deltaepsilon)) log^2(1/epsilon)ig)$, where $Delta > 0$ is a problem-structure-dependent constant. This result significantly improves upon prior instance-independent bounds.
📝 Abstract
We consider the reinforcement learning problem for the constrained Markov decision process (CMDP), which plays a central role in satisfying safety or resource constraints in sequential learning and decision-making. In this problem, we are given finite resources and a MDP with unknown transition probabilities. At each stage, we take an action, collecting a reward and consuming some resources, all assumed to be unknown and need to be learned over time. In this work, we take the first step towards deriving optimal problem-dependent guarantees for the CMDP problems. We derive a logarithmic regret bound, which translates into a $O(frac{1}{Deltacdotepsilon}cdotlog^2(1/epsilon))$ sample complexity bound, with $Delta$ being a problem-dependent parameter, yet independent of $epsilon$. Our sample complexity bound improves upon the state-of-art $O(1/epsilon^2)$ sample complexity for CMDP problems established in the previous literature, in terms of the dependency on $epsilon$. To achieve this advance, we develop a new framework for analyzing CMDP problems. To be specific, our algorithm operates in the primal space and we resolve the primal LP for the CMDP problem at each period in an online manner, with adaptive remaining resource capacities. The key elements of our algorithm are: i) a characterization of the instance hardness via LP basis, ii) an eliminating procedure that identifies one optimal basis of the primal LP, and; iii) a resolving procedure that is adaptive to the remaining resources and sticks to the characterized optimal basis.