🤖 AI Summary
Mainstream reward models (RMs) rely solely on unidirectional causal attention and Siamese encoding, preventing fine-grained token-level interaction between prompts and responses—rendering them vulnerable to “attention hijacking” and undermining preference judgment stability.
Method: We first identify this architectural limitation and propose an interactive distillation framework: a teacher model with full global attention (e.g., an interactive NLU model) guides a decoder-only student RM to learn both intra- and inter-sequence fine-grained attention patterns, enforced via an attention alignment loss for effective knowledge transfer.
Contribution/Results: Experiments demonstrate substantial improvements in reward signal stability and generalization across multiple benchmarks, outperforming existing RM optimization methods. Our approach effectively mitigates attention bias, establishing a new state-of-the-art in robust reward modeling.
📝 Abstract
The reward model (RM), as the core component of reinforcement learning from human feedback (RLHF) for large language models (LLMs), responsible for providing reward signals to generated responses. However, mainstream preference modeling in RM is inadequate in terms of token-level interaction, making its judgment signals vulnerable to being hacked by misallocated attention to context. This stems from two fundamental limitations: (1) Current preference modeling employs decoder-only architectures, where the unidirectional causal attention mechanism leads to forward-decaying intra-sequence attention within the prompt-response sequence. (2) The independent Siamese-encoding paradigm induces the absence of token-level inter-sequence attention between chosen and rejected sequences. To address this "attention hacking", we propose "Interaction Distillation", a novel training framework for more adequate preference modeling through attention-level optimization. The method introduces an interaction-based natural language understanding model as the teacher to provide sophisticated token interaction patterns via comprehensive attention, and guides the preference modeling to simulate teacher model's interaction pattern through an attentional alignment objective. Through extensive experiments, interaction distillation has demonstrated its ability to provide more stable and generalizable reward signals compared to state-of-the-art RM optimization methods that target data noise, highlighting the attention hacking constitute a more fundamental limitation in RM.