HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

251K/year

🤖 AI Summary

Reward models used for aligning large language models are vulnerable to reward hacking and lack robustness. This work addresses this issue by modeling reward hacking as a multidimensional subspace structure within the residual stream—a perspective introduced for the first time. The authors propose a training-free method that edits the reward head vector by identifying relevant residual directions through comparison between gold-standard and hacked samples, constructing a corresponding subspace, and projecting out components of the reward head aligned with this subspace. Evaluated across eight mainstream reward models, the approach significantly enhances resistance to reward hacking, outperforming fine-tuning baselines while preserving the models’ original general-purpose performance.

📝 Abstract

Reward models are central to large language model (LLM) alignment, but they remain vulnerable to reward hacking. To evaluate reward-model robustness, we introduce RewardHackBench containing 13 reward-hacking patterns covering real life high-stakes domains and general settings, and we find severe failures on specific subcategories across eight reward models. To mitigate these failures, we propose HARVE, a training-free reward-head editing method for scalar reward models. Instead of fine-tuning the reward model, HARVE identifies a multi-directional hacking subspace from residual stream directions associated with selected hacking subcategories, and removes the component of the reward-head vector aligned with that subspace. This directly reduces the reward head's sensitivity to hacking-related features using only a small set of contrastive gold-hacked examples, without gradient updates or fine-tuning. Comprehensive experiments across eight reward models indicates that \model improves hacking robustness, outperforms fine-tuning baselines, and preserves reward-models' general capability. Further analyses suggest that reward hacking is better captured as a multidimensional residual-space structure than by isolated surface cues.

Problem

Research questions and friction points this paper is trying to address.

reward hacking

reward models

robustness

large language models

alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

reward hacking

reward model robustness

vector editing