🤖 AI Summary
This work addresses the challenge in constrained reinforcement learning where balancing reward maximization and safety is hindered by sharp minima in value functions—leading to poor generalization—and inadequate modeling of heavy-tailed risk. To overcome these limitations, the authors propose a co-optimization framework that enhances critic diversity via adaptive stochastic gradient Langevin dynamics, models the full cost distribution using implicit quantile networks to optimize Conditional Value-at-Risk (CVaR), and dynamically adjusts constraint tightness through a CVaR-based reactive Lagrangian relaxation mechanism. Evaluated on the Safety-Gymnasium benchmark, the method achieves the lowest cost in 7 out of 10 tasks, reduces costs by 19%–63% in speed-oriented tasks, and maintains competitive return performance.
📝 Abstract
Balancing reward and safety in constrained reinforcement learning remains challenging due to poor generalization from sharp value minima and inadequate handling of heavy-tailed risk distribution. We introduce Safe Langevin Soft Actor-Critic (SL-SAC), a principled algorithm that addresses both issues through parameter-space exploration and distributional risk control. Our approach combines three key mechanisms: (1) Adaptive Stochastic Gradient Langevin Dynamics (aSGLD) for reward critics, promoting ensemble diversity and escape from poor optima; (2) distributional cost estimation via Implicit Quantile Networks (IQN) with Conditional Value-at-Risk (CVaR) optimization for tail-risk mitigation; and (3) a reactive Lagrangian relaxation scheme that adapts constraint enforcement based on the empirical CVaR of episodic costs. We provide theoretical guarantees on CVaR estimation error and demonstrate that CVaR-based Lagrange updates yield stronger constraint violation signals than expected-cost updates. On Safety-Gymnasium benchmarks, SL-SAC achieves the lowest cost in 7 out of 10 tasks while maintaining competitive returns, with cost reductions of 19-63% in velocity tasks compared to state-of-the-art baselines.