π€ AI Summary
This work addresses the reward hacking problem in reinforcement learning, where the agentβs proxy reward is only partially aligned with the true objective, by formally framing it as a robust policy optimization problem over the set of proxy rewards satisfying an r-alignment constraint. The proposed approach employs a maxβmin optimization framework augmented with occupancy measure regularization and a linear reward feature prior, ensuring policy performance under the worst-case proxy reward. Empirical results demonstrate that, across diverse environments and varying degrees of reward alignment, the method consistently outperforms baseline approaches such as ORPO in terms of worst-case return, robustness, and training stability. Furthermore, it yields interpretable worst-case reward structures and provides stronger theoretical guarantees.
π Abstract
Designing robust reinforcement learning (RL) agents in the presence of imperfect reward signals remains a core challenge. In practice, agents are often trained with proxy rewards that only approximate the true objective, leaving them vulnerable to reward hacking, where high proxy returns arise from unintended or exploitative behaviors. Recent work formalizes this issue using r-correlation between proxy and true rewards, but existing methods like occupancy-regularized policy optimization (ORPO) optimize against a fixed proxy and do not provide strong guarantees against broader classes of correlated proxies. In this work, we formulate reward hacking as a robust policy optimization problem over the space of all r-correlated proxy rewards. We derive a tractable max-min formulation, where the agent maximizes performance under the worst-case proxy consistent with the correlation constraint. We further show that when the reward is a linear function of known features, our approach can be adapted to incorporate this prior knowledge, yielding both improved policies and interpretable worst-case rewards. Experiments across several environments show that our algorithms consistently outperform ORPO in worst-case returns, and offer improved robustness and stability across different levels of proxy-true reward correlation. These results show that our approach provides both robustness and transparency in settings where reward design is inherently uncertain. The code is available at https://github.com/ZixuanLiu4869/reward_hacking.