🤖 AI Summary
This work addresses the challenge that reinforcement learning policies often exhibit unsafe behaviors under environmental perturbations and struggle to provide reliable probabilistic safety guarantees. To tackle this, the authors propose a dual-optimization framework grounded in the latent space of a variational autoencoder (VAE). First, the VAE approximates the state distribution, enabling the construction of upper- and lower-bound probabilistic barrier certificates in the latent space that characterize known safe regions with high confidence. Subsequently, the method focuses on non-robust regions by performing targeted scenario sampling to solve a bi-objective optimization problem, thereby tightening the bounds on the probability of safety violations. Experimental results demonstrate that the approach significantly improves the tightness and reliability of safety probability estimates, yielding sharper, risk-aware safety assurances.
📝 Abstract
Guaranteeing safety is critical to the deployment of reinforcement learning (RL) agents in the real-world, especially as policies learned using deep RL may demonstrate susceptibility to transition perturbations that result in unknown or unsafe behaviour. A method of policy verification is to construct probabilistic barrier-certificates by sampling policy trajectories with respect to safety constraints, thereby demarcating known safe behaviour from unknown behaviour. Obtaining tight upper and lower bounds on the probability of violation of these constraints may be difficult if the policy is susceptible to transition uncertainty or perturbation that places the agent in insufficiently explored states. To address this, we approximate the distribution of the encountered state-space using a variational autoencoder (VAE) and construct upper and lower-bound barrier-certificates using latent characteristics of states to optimize for regions of known, safe behaviour with high confidence. We frame this in our work as a dual optimization problem where the lower-bound barrier-certificate presents a more conservative estimate of the safe region than the upper-bound barrier-certificate. Sampling states that lie within the set difference of the two during training, i.e. the non-robust region, allows us to tighten the upper and lower bounds to provide sharper probabilistic guarantees on safety. Within our study, we describe the guarantees placed and demonstrate the tightness of our bounds experimentally.