🤖 AI Summary
This paper addresses the instability of value function estimation in reinforcement learning by proposing the Bellman Error Centering (BEC) unified framework, which reveals that Variance-Reduced Critics (VRC) are inherently instances of BEC. It systematically resolves the challenge of constructing centered fixed points for both tabular and linear value function approximation. We introduce, for the first time, Centered TD (CTD) and Centered TD Correction (CTDC) algorithms—both equipped with rigorous theoretical convergence guarantees—and prove their convergence under both on-policy and off-policy settings. The BEC paradigm generalizes to a broad class of temporal-difference algorithms, significantly enhancing training stability and robustness. Empirical evaluations demonstrate that our methods consistently outperform conventional reward centering across diverse tasks and policy distributions, delivering uniform performance gains without task-specific tuning.
📝 Abstract
This paper revisits the recently proposed reward centering algorithms including simple reward centering (SRC) and value-based reward centering (VRC), and points out that SRC is indeed the reward centering, while VRC is essentially Bellman error centering (BEC). Based on BEC, we provide the centered fixpoint for tabular value functions, as well as the centered TD fixpoint for linear value function approximation. We design the on-policy CTD algorithm and the off-policy CTDC algorithm, and prove the convergence of both algorithms. Finally, we experimentally validate the stability of our proposed algorithms. Bellman error centering facilitates the extension to various reinforcement learning algorithms.