🤖 AI Summary
This work addresses the reward hacking problem in reinforcement learning from human feedback (RLHF), which arises due to errors in reward modeling. The authors propose a pessimistic optimization framework grounded in distributional reward modeling, treating the reward as a random variable \( p(r \mid x, y) \). Within this framework—formulated via Bayesian inference or KL-divergence distributionally robust optimization (KL-DRO)—they unify existing heuristic strategies such as mean aggregation, worst-case optimization, and uncertainty weighting, while clarifying their underlying assumptions. The key contribution is the derivation of a closed-form pessimistic reward function \( \tilde{r}(x, y) = -\beta \log \mathbb{E}_p[e^{-r/\beta}] \), which provides both theoretical justification and practical guidance for mitigating reward hacking in RLHF systems.
📝 Abstract
Reinforcement learning from human feedback (RLHF) is bottlenecked by \emph{reward hacking}, where the policy exploits errors in a proxy reward model (RM) and produces high RM scores without genuine quality gains. A natural mitigation is \emph{pessimism}: penalizing rewards in regions where the RM is uncertain. However, standard scalar RMs provide no principled notion of uncertainty. We argue that the right object is a \emph{distributional} reward model $p(r\mid x,y)$. Under either a Bayesian inference or a KL-distributionally robust optimization (KL-DRO) lens, the KL-regularized RLHF objective admits a closed-form effective reward $\tilde r(x,y) = \pmβ\log\mathbb{E}_p[e^{\pm r/β}]$. The pessimistic branch unifies the prior heuristics for RM ensemble aggregation: mean aggregation, worst-case optimization (WCO), and uncertainty-weighted optimization (UWO) all emerge as limits or truncations of this single expression. This also clarifies the implicit assumptions of each existing rule.