🤖 AI Summary
Current language reward models are susceptible to systematic biases—such as length preference, sycophancy, overconfidence, stylistic artifacts, and answer ordering—during alignment, leading to reward hacking and the reinforcement of undesirable behaviors. This work systematically evaluates prevalent bias types in mainstream reward models, distinguishing between failure modes amenable to linear intervention and those resistant to such correction. The authors propose a lightweight, mechanism-based post-processing reward shaping method that requires only minimal annotated data. By employing linear probes to identify and rectify low-complexity biases embedded in the model’s internal representations, the approach substantially reduces targeted biases without compromising reward quality. Moreover, it demonstrates strong out-of-distribution generalization across diverse evaluation settings.
📝 Abstract
Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning is vulnerable to reward hacking, whereby LM policies learn undesirable behaviors from flawed RMs. By systematically measuring biases in five high-quality RMs, including the state-of-the-art, we find that issues persist despite prior work with respect to length, sycophancy, and overconfidence. We also discover new issues related to bias toward model-specific styles and answer-order. We categorize RM failures by complexity and propose a simple post-hoc intervention to mitigate low-complexity biases that arise from spurious correlations. Our proposed mechanistic reward shaping reduces targeted biases without degrading reward quality and while using minimal labeled data. The method is extensible to new biases, model-internal, and generalizes out-of-distribution.