A Unifying Lens on Reward Uncertainty in RLHF

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the reward hacking problem in reinforcement learning from human feedback (RLHF), which arises due to errors in reward modeling. The authors propose a pessimistic optimization framework grounded in distributional reward modeling, treating the reward as a random variable \( p(r \mid x, y) \). Within this framework—formulated via Bayesian inference or KL-divergence distributionally robust optimization (KL-DRO)—they unify existing heuristic strategies such as mean aggregation, worst-case optimization, and uncertainty weighting, while clarifying their underlying assumptions. The key contribution is the derivation of a closed-form pessimistic reward function \( \tilde{r}(x, y) = -\beta \log \mathbb{E}_p[e^{-r/\beta}] \), which provides both theoretical justification and practical guidance for mitigating reward hacking in RLHF systems.
📝 Abstract
Reinforcement learning from human feedback (RLHF) is bottlenecked by \emph{reward hacking}, where the policy exploits errors in a proxy reward model (RM) and produces high RM scores without genuine quality gains. A natural mitigation is \emph{pessimism}: penalizing rewards in regions where the RM is uncertain. However, standard scalar RMs provide no principled notion of uncertainty. We argue that the right object is a \emph{distributional} reward model $p(r\mid x,y)$. Under either a Bayesian inference or a KL-distributionally robust optimization (KL-DRO) lens, the KL-regularized RLHF objective admits a closed-form effective reward $\tilde r(x,y) = \pmβ\log\mathbb{E}_p[e^{\pm r/β}]$. The pessimistic branch unifies the prior heuristics for RM ensemble aggregation: mean aggregation, worst-case optimization (WCO), and uncertainty-weighted optimization (UWO) all emerge as limits or truncations of this single expression. This also clarifies the implicit assumptions of each existing rule.
Problem

Research questions and friction points this paper is trying to address.

reward hacking
reward uncertainty
RLHF
distributional reward model
pessimism
Innovation

Methods, ideas, or system contributions that make the work stand out.

distributional reward model
reward uncertainty
KL-DRO
pessimism in RLHF
unified reward aggregation