🤖 AI Summary
This work addresses the limitations of existing off-policy safe reinforcement learning methods, which typically estimate reward and safety Q-values independently, neglecting their inter-task correlations and consequently yielding overly conservative policies and poor sample efficiency. To overcome this, the authors propose COP-Q, a novel approach that explicitly incorporates the covariance between objectives into Q-value estimation. By leveraging Cholesky decomposition to construct a joint confidence bound and embedding objective priorities to adaptively mitigate excessive conservatism in reward estimation, COP-Q achieves a more balanced trade-off between performance and safety. Implemented within a deep Q-learning framework, COP-Q integrates temporal difference updates with actor-critic optimization. Empirical results on Brax and Safety-Gymnasium benchmarks demonstrate that COP-Q consistently attains strong safety guarantees under both hard and soft constraints while matching or exceeding the sample efficiency of prior methods.
📝 Abstract
Safe robot control requires maximizing return while satisfying safety constraints. In off-policy safe reinforcement learning, reward and safety Q-values are commonly learned by separate critic ensembles, with uncertainty handled independently for each objective. This objective-wise treatment neglects inter-objective correlation and can lead to overly conservative value estimates, thereby reducing sample efficiency. To address this issue, we propose Cholesky-Ordered Projection Q-learning (COP-Q), a safety-first method that incorporates inter-objective covariance into vector-valued Q-value estimation. COP-Q constructs a generalized confidence bound in the joint Q-value space and uses Cholesky factorization to encode objective priority in a sequential form. This preserves conservatism on safety while adaptively reducing excessive conservatism on the reward objective. The resulting estimate is used in both temporal-difference target computation and actor optimization. COP-Q incurs minimal computational overhead and is readily compatible with most existing deep Q-learning frameworks. Experiments on robot locomotion in Brax and safe navigation in Safety-Gymnasium, covering both hard- and soft-safety settings, demonstrate that COP-Q achieves strong safety performance together with competitive or improved sample efficiency relative to representative baselines.