A Unifying Lens on Reward Uncertainty in RLHF

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the reward hacking problem in reinforcement learning from human feedback (RLHF), which arises due to errors in reward modeling. The authors propose a pessimistic optimization framework grounded in distributional reward modeling, treating the reward as a random variable $ p(r \mid x, y) $. Within this framework—formulated via Bayesian inference or KL-divergence distributionally robust optimization (KL-DRO)—they unify existing heuristic strategies such as mean aggregation, worst-case optimization, and uncertainty weighting, while clarifying their underlying assumptions. The key contribution is the derivation of a closed-form pessimistic reward function $ \tilde{r}(x, y) = -\beta \log \mathbb{E}_p[e^{-r/\beta}] $, which provides both theoretical justification and practical guidance for mitigating reward hacking in RLHF systems.

📝 Abstract

Reinforcement learning from human feedback (RLHF) is bottlenecked by \emph{reward hacking}, where the policy exploits errors in a proxy reward model (RM) and produces high RM scores without genuine quality gains. A natural mitigation is \emph{pessimism}: penalizing rewards in regions where the RM is uncertain. However, standard scalar RMs provide no principled notion of uncertainty. We argue that the right object is a \emph{distributional} reward model $p(r\mid x,y)$. Under either a Bayesian inference or a KL-distributionally robust optimization (KL-DRO) lens, the KL-regularized RLHF objective admits a closed-form effective reward $\tilde r(x,y) = \pmβ\log\mathbb{E}_p[e^{\pm r/β}]$. The pessimistic branch unifies the prior heuristics for RM ensemble aggregation: mean aggregation, worst-case optimization (WCO), and uncertainty-weighted optimization (UWO) all emerge as limits or truncations of this single expression. This also clarifies the implicit assumptions of each existing rule.

Problem

Research questions and friction points this paper is trying to address.

reward hacking

reward uncertainty

RLHF

distributional reward model

pessimism

Innovation

Methods, ideas, or system contributions that make the work stand out.

distributional reward model

reward uncertainty

KL-DRO