π€ AI Summary
This paper addresses safe reinforcement learning in unknown high-risk environments featuring irreversible catastrophic costs and no resets. We propose a safety policy based on proactive human-in-the-loop intervention: when encountering uncontrollable state risks, the agent may request external assistance to avoid catastrophes while preserving long-term performance. For the first time, we establish a no-regret guarantee within the general MDP framework, rigorously proving that safe intervention and efficient learning are inherently compatible. Our theoretical contributions include: (1) the first sublinear regret boundβO(βT)βfor MDPs with irreversible costs; (2) a formal zero-catastrophe guarantee; and (3) no reliance on environment resets or performance trade-offs. The results unify safety and efficiency, yielding a provably safe RL paradigm for high-stakes autonomous decision-making.
π Abstract
Most reinforcement learning algorithms with regret guarantees rely on a critical assumption: that all errors are recoverable. Recent work by Plaut et al. discarded this assumption and presented algorithms that avoid"catastrophe"(i.e., irreparable errors) by asking for help. However, they provided only safety guarantees and did not consider reward maximization. We prove that any algorithm that avoids catastrophe in their setting also guarantees high reward (i.e., sublinear regret) in any Markov Decision Process (MDP), including MDPs with irreversible costs. This constitutes the first no-regret guarantee for general MDPs. More broadly, our result may be the first formal proof that it is possible for an agent to obtain high reward while becoming self-sufficient in an unknown, unbounded, and high-stakes environment without causing catastrophe or requiring resets.