π€ AI Summary
Reward models used for aligning large language models are vulnerable to reward hacking and lack robustness. This work addresses this issue by modeling reward hacking as a multidimensional subspace structure within the residual streamβa perspective introduced for the first time. The authors propose a training-free method that edits the reward head vector by identifying relevant residual directions through comparison between gold-standard and hacked samples, constructing a corresponding subspace, and projecting out components of the reward head aligned with this subspace. Evaluated across eight mainstream reward models, the approach significantly enhances resistance to reward hacking, outperforming fine-tuning baselines while preserving the modelsβ original general-purpose performance.
π Abstract
Reward models are central to large language model (LLM) alignment, but they remain vulnerable to reward hacking. To evaluate reward-model robustness, we introduce RewardHackBench containing 13 reward-hacking patterns covering real life high-stakes domains and general settings, and we find severe failures on specific subcategories across eight reward models. To mitigate these failures, we propose HARVE, a training-free reward-head editing method for scalar reward models. Instead of fine-tuning the reward model, HARVE identifies a multi-directional hacking subspace from residual stream directions associated with selected hacking subcategories, and removes the component of the reward-head vector aligned with that subspace. This directly reduces the reward head's sensitivity to hacking-related features using only a small set of contrastive gold-hacked examples, without gradient updates or fine-tuning. Comprehensive experiments across eight reward models indicates that \model improves hacking robustness, outperforms fine-tuning baselines, and preserves reward-models' general capability. Further analyses suggest that reward hacking is better captured as a multidimensional residual-space structure than by isolated surface cues.