SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of safe exploration for reinforcement learning agents in safety-critical scenarios. The authors propose a pessimistic policy optimization method grounded in epistemic uncertainty, which estimates uncertainty through the policy’s sensitivity to parameter perturbations. They introduce a sharpness-aware policy gradient that implicitly reweights gradient updates—amplifying the influence of rare unsafe actions while attenuating contributions from known safe behaviors—to encourage conservative behavior in unexplored regions of the state space. Evaluated across multiple continuous control tasks, the approach significantly outperforms existing baselines, achieving both enhanced task performance and stronger safety guarantees, thereby effectively expanding the Pareto frontier between safety and performance.
📝 Abstract
Safe exploration is a prerequisite for deploying reinforcement learning (RL) agents in safety-critical domains. In this paper, we approach safe exploration through the lens of epistemic uncertainty, where the actor's sensitivity to parameter perturbations serves as a practical proxy for regions of high uncertainty. We propose Sharpness-Aware Policy Optimization (SHAPO), a sharpness-aware policy update rule that evaluates gradients at perturbed parameters, making policy updates pessimistic with respect to the actor's epistemic uncertainty. Analytically we show that this adjustment implicitly reweighs policy gradients, amplifying the influence of rare unsafe actions while tempering contributions from already safe ones, thereby biasing learning toward conservative behavior in under-explored regions. Across several continuous-control tasks, our method consistently improves both safety and task performance over existing baselines, significantly expanding their Pareto frontiers.
Problem

Research questions and friction points this paper is trying to address.

safe exploration
reinforcement learning
epistemic uncertainty
policy optimization
safety-critical domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sharpness-Aware Optimization
Safe Reinforcement Learning
Epistemic Uncertainty
Policy Gradient Reweighting
Conservative Exploration