🤖 AI Summary
In RLHF, policy evolution induces distributional shift in generated responses, causing reward model (RM) miscalibration and subsequent reward hacking—where RM scores increase while human preference alignment deteriorates. This work is the first to formally model RM bias through the lens of distributional shift. We propose an off-policy correction method that requires no additional human annotations: it achieves consistent RM parameter estimation via iterative importance weighting, seamlessly integrating importance sampling into a joint optimization framework for reward modeling and policy gradient updates. Evaluated on summarization and dialogue tasks, our approach significantly outperforms standard RLHF—improving human preference alignment of the final policy without compromising RM generalization, thereby effectively mitigating reward over-optimization.
📝 Abstract
Reinforcement Learning from Human Feedback (RLHF) allows us to train models, such as language models (LMs), to follow complex human preferences. In RLHF for LMs, we first train an LM using supervised fine-tuning, sample pairs of responses, obtain human feedback, and use the resulting data to train a reward model (RM). RL methods are then used to train the LM to maximize the reward given by the RM. As training progresses, the responses generated by the LM no longer resemble the responses seen by the RM during training, leading to the RM becoming inaccurate. The score given by the RM keeps increasing, but the learned behavior no longer matches the human preferences. This issue is known as overoptimization. We investigate overoptimization from the point of view of distribution shift and show that the shift results in an inconsistent estimate of the RM parameters, leading to an inconsistent estimate of the policy gradient. We propose Off-Policy Corrected Reward Modeling (OCRM), which iteratively off-policy corrects the RM using importance weighting, without requiring new labels or samples. This results in a more accurate RM, which empirically leads to an improved final policy. We validate our approach in experiments with summarization and chatbot datasets and show that it performs significantly better than standard RLHF methods and baselines. Our implementation is available at https://github.com/JohannesAck/OffPolicyCorrectedRewardModeling