Provably Optimal Reinforcement Learning under Safety Filtering

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Reinforcement learning (RL) in safety-critical applications faces a fundamental challenge: ensuring safety without compromising asymptotic performance—conventional safety filters are often assumed to degrade optimal policy performance. Method: We model safety filters as integral components of the environment, formalizing safety-critical Markov decision processes (SC-MDPs) and their filtered MDP counterparts, enabling seamless integration with standard RL algorithms. Contribution/Results: We provide the first theoretical proof that, under sufficiently permissive safety constraints, safety enforcement preserves the asymptotic optimality of the unconstrained policy—thereby refuting the long-held “safety–performance trade-off” assumption. Experiments on Safety Gymnasium demonstrate zero safety violations throughout training, with final task performance matching or exceeding that of unsafe baselines. Our framework establishes the theoretical separability of safety enforcement and performance optimization, delivering both provable safety guarantees and asymptotically optimal performance.

Technology Category

Application Category

📝 Abstract

Recent advances in reinforcement learning (RL) enable its use on increasingly complex tasks, but the lack of formal safety guarantees still limits its application in safety-critical settings. A common practical approach is to augment the RL policy with a safety filter that overrides unsafe actions to prevent failures during both training and deployment. However, safety filtering is often perceived as sacrificing performance and hindering the learning process. We show that this perceived safety-performance tradeoff is not inherent and prove, for the first time, that enforcing safety with a sufficiently permissive safety filter does not degrade asymptotic performance. We formalize RL safety with a safety-critical Markov decision process (SC-MDP), which requires categorical, rather than high-probability, avoidance of catastrophic failure states. Additionally, we define an associated filtered MDP in which all actions result in safe effects, thanks to a safety filter that is considered to be a part of the environment. Our main theorem establishes that (i) learning in the filtered MDP is safe categorically, (ii) standard RL convergence carries over to the filtered MDP, and (iii) any policy that is optimal in the filtered MDP-when executed through the same filter-achieves the same asymptotic return as the best safe policy in the SC-MDP, yielding a complete separation between safety enforcement and performance optimization. We validate the theory on Safety Gymnasium with representative tasks and constraints, observing zero violations during training and final performance matching or exceeding unfiltered baselines. Together, these results shed light on a long-standing question in safety-filtered learning and provide a simple, principled recipe for safe RL: train and deploy RL policies with the most permissive safety filter that is available.

Problem

Research questions and friction points this paper is trying to address.

Ensuring categorical safety in reinforcement learning without performance degradation

Formalizing safety-critical MDPs with provable safe action filtering

Achieving optimal RL performance through permissive safety filters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using safety filter for categorical failure state avoidance

Defining filtered MDP to maintain safe action effects

Proving optimal policies in filtered MDP match SC-MDP performance

🔎 Similar Papers

Efficient Policy Evaluation with Safety Constraint for Reinforcement Learning