Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of safety violations in off-policy safe reinforcement learning, which often arise from unconstrained exploration and biased cost estimation. To mitigate these issues, the authors propose the COX-Q algorithm, which reconciles conflicting reward and cost gradients through cost-constrained optimistic exploration. The method employs a truncated quantile critic to stabilize cost learning and quantify epistemic uncertainty, while a cost-constrained trust region adaptively modulates the policy update range. This approach establishes a distributional off-policy Q-learning framework that significantly improves sample efficiency, effectively controls data collection costs, and achieves high safety performance during testing across benchmark tasks including safe velocity control, navigation, and autonomous driving.

Technology Category

Application Category

📝 Abstract
When safety is formulated as a limit of cumulative cost, safe reinforcement learning (RL) aims to learn policies that maximize return subject to the cost constraint in data collection and deployment. Off-policy safe RL methods, although offering high sample efficiency, suffer from constraint violations due to cost-agnostic exploration and estimation bias in cumulative cost. To address this issue, we propose Constrained Optimistic eXploration Q-learning (COX-Q), an off-policy safe RL algorithm that integrates cost-bounded online exploration and conservative offline distributional value learning. First, we introduce a novel cost-constrained optimistic exploration strategy that resolves gradient conflicts between reward and cost in the action space and adaptively adjusts the trust region to control the training cost. Second, we adopt truncated quantile critics to stabilize the cost value learning. Quantile critics also quantify epistemic uncertainty to guide exploration. Experiments on safe velocity, safe navigation, and autonomous driving tasks demonstrate that COX-Q achieves high sample efficiency, competitive test safety performance, and controlled data collection cost. The results highlight COX-Q as a promising RL method for safety-critical applications.
Problem

Research questions and friction points this paper is trying to address.

safe reinforcement learning
off-policy
constraint violation
cumulative cost
exploration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constrained Optimistic Exploration
Off-policy Safe RL
Truncated Quantile Critics
Cost-bounded Exploration
Distributional Value Learning
G
Guopeng Li
Faculty of Mechanical Engineering, Delft University of Technology, Delft, the Netherlands
Matthijs T. J. Spaan
Matthijs T. J. Spaan
Delft University of Technology
J
Julian F. P. Kooij
Faculty of Mechanical Engineering, Delft University of Technology, Delft, the Netherlands