🤖 AI Summary
In offline reinforcement learning, fixed-strength regularization struggles to accommodate state-action data quality heterogeneity: insufficient regularization exacerbates extrapolation error and value overestimation, while excessive regularization hinders policy optimization. To address this, we propose Selective State-Adaptive Regularization (SSAR), the first framework unifying Conservative Q-Learning (CQL) and explicit policy-constrained methods theoretically. SSAR dynamically modulates Bellman update trustworthiness and regularization strength via state-dependent coefficients—imposing strong constraints on high-quality state-action pairs to suppress extrapolation, while preserving optimization freedom in low-quality regions. The method integrates representative-value regularization, policy-constraint modeling, and Bellman error sensitivity analysis. On the D4RL benchmark, SSAR significantly outperforms prior state-of-the-art methods; moreover, it demonstrates superior generalization and stability in offline-to-online transfer tasks.
📝 Abstract
Offline reinforcement learning (RL) aims to learn an effective policy from a static dataset. To alleviate extrapolation errors, existing studies often uniformly regularize the value function or policy updates across all states. However, due to substantial variations in data quality, the fixed regularization strength often leads to a dilemma: Weak regularization strength fails to address extrapolation errors and value overestimation, while strong regularization strength shifts policy learning toward behavior cloning, impeding potential performance enabled by Bellman updates. To address this issue, we propose the selective state-adaptive regularization method for offline RL. Specifically, we introduce state-adaptive regularization coefficients to trust state-level Bellman-driven results, while selectively applying regularization on high-quality actions, aiming to avoid performance degradation caused by tight constraints on low-quality actions. By establishing a connection between the representative value regularization method, CQL, and explicit policy constraint methods, we effectively extend selective state-adaptive regularization to these two mainstream offline RL approaches. Extensive experiments demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in both offline and offline-to-online settings on the D4RL benchmark.