Learning to Trust Bellman Updates: Selective State-Adaptive Regularization for Offline RL

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In offline reinforcement learning, fixed-strength regularization struggles to accommodate state-action data quality heterogeneity: insufficient regularization exacerbates extrapolation error and value overestimation, while excessive regularization hinders policy optimization. To address this, we propose Selective State-Adaptive Regularization (SSAR), the first framework unifying Conservative Q-Learning (CQL) and explicit policy-constrained methods theoretically. SSAR dynamically modulates Bellman update trustworthiness and regularization strength via state-dependent coefficients—imposing strong constraints on high-quality state-action pairs to suppress extrapolation, while preserving optimization freedom in low-quality regions. The method integrates representative-value regularization, policy-constraint modeling, and Bellman error sensitivity analysis. On the D4RL benchmark, SSAR significantly outperforms prior state-of-the-art methods; moreover, it demonstrates superior generalization and stability in offline-to-online transfer tasks.

Technology Category

Application Category

📝 Abstract
Offline reinforcement learning (RL) aims to learn an effective policy from a static dataset. To alleviate extrapolation errors, existing studies often uniformly regularize the value function or policy updates across all states. However, due to substantial variations in data quality, the fixed regularization strength often leads to a dilemma: Weak regularization strength fails to address extrapolation errors and value overestimation, while strong regularization strength shifts policy learning toward behavior cloning, impeding potential performance enabled by Bellman updates. To address this issue, we propose the selective state-adaptive regularization method for offline RL. Specifically, we introduce state-adaptive regularization coefficients to trust state-level Bellman-driven results, while selectively applying regularization on high-quality actions, aiming to avoid performance degradation caused by tight constraints on low-quality actions. By establishing a connection between the representative value regularization method, CQL, and explicit policy constraint methods, we effectively extend selective state-adaptive regularization to these two mainstream offline RL approaches. Extensive experiments demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in both offline and offline-to-online settings on the D4RL benchmark.
Problem

Research questions and friction points this paper is trying to address.

Addresses extrapolation errors in offline RL
Balances regularization strength for data quality variations
Improves performance by selective state-adaptive regularization
Innovation

Methods, ideas, or system contributions that make the work stand out.

State-adaptive regularization coefficients for Bellman trust
Selective regularization on high-quality actions only
Extends CQL and policy constraint methods adaptively
🔎 Similar Papers
2024-05-23Trans. Mach. Learn. Res.Citations: 0
Q
Qin-Wen Luo
Nanjing University of Aeronautics and Astronautics, Nanjing, China
Ming-Kun Xie
Ming-Kun Xie
RIKEN Center for Advanced Intelligence Project
machine learning
Y
Ye-Wen Wang
Nanjing University of Aeronautics and Astronautics, Nanjing, China
Sheng-Jun Huang
Sheng-Jun Huang
Nanjing University of Aeronautics & Astronautics
Machine Learning