Verification-Guided Falsification for Safe RL via Explainable Abstraction and Risk-Aware Exploration

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Ensuring safety verification of reinforcement learning (RL) policies in high-risk scenarios remains challenging due to the complexity and opacity of learned behaviors. Method: This paper proposes a synergistic closed-loop framework integrating explainable abstraction with empirical falsification. It introduces the Explainable Abstraction Graph (CAPS) for semantic policy compression; combines probabilistic model checking (via Storm) with risk-weighted counterexample search; provides PAC-style guarantees on undetected violation discovery; and incorporates a lightweight runtime safety shield (fallback mechanism) for failure mitigation without retraining. Results: Evaluated on multiple continuous-control benchmarks, the approach improves violation detection rate by 37%, uncovers 12.8% hidden defects even when abstraction-based verification passes, achieves counterexample interpretability of 4.6/5.0 (expert-rated), and ensures sub-5 ms response latency.

Technology Category

Application Category

📝 Abstract

Ensuring the safety of reinforcement learning (RL) policies in high-stakes environments requires not only formal verification but also interpretability and targeted falsification. While model checking provides formal guarantees, its effectiveness is limited by abstraction quality and the completeness of the underlying trajectory dataset. We propose a hybrid framework that integrates (1) explainability, (2) model checking, and (3) risk-guided falsification to achieve both rigor and coverage. Our approach begins by constructing a human-interpretable abstraction of the RL policy using Comprehensible Abstract Policy Summarization (CAPS). This abstract graph, derived from offline trajectories, is both verifier-friendly, semantically meaningful, and can be used as input to Storm probabilistic model checker to verify satisfaction of temporal safety specifications. If the model checker identifies a violation, it will return an interpretable counterexample trace by which the policy fails the safety requirement. However, if no violation is detected, we cannot conclude satisfaction due to potential limitation in the abstraction and coverage of the offline dataset. In such cases, we estimate associated risk during model checking to guide a falsification strategy that prioritizes searching in high-risk states and regions underrepresented in the trajectory dataset. We further provide PAC-style guarantees on the likelihood of uncovering undetected violations. Finally, we incorporate a lightweight safety shield that switches to a fallback policy at runtime when such a risk exceeds a threshold, facilitating failure mitigation without retraining.

Problem

Research questions and friction points this paper is trying to address.

Ensuring RL policy safety via verification and interpretability

Improving abstraction quality for effective model checking

Guiding falsification with risk-aware exploration strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Explainable abstraction via CAPS for interpretability

Risk-guided falsification targeting high-risk states

Lightweight safety shield for runtime failure mitigation

🔎 Similar Papers

Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation