Mitigating Attention Hacking in Preference-Based Reward Modeling via Interaction Distillation

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Mainstream reward models (RMs) rely solely on unidirectional causal attention and Siamese encoding, preventing fine-grained token-level interaction between prompts and responses—rendering them vulnerable to “attention hijacking” and undermining preference judgment stability. Method: We first identify this architectural limitation and propose an interactive distillation framework: a teacher model with full global attention (e.g., an interactive NLU model) guides a decoder-only student RM to learn both intra- and inter-sequence fine-grained attention patterns, enforced via an attention alignment loss for effective knowledge transfer. Contribution/Results: Experiments demonstrate substantial improvements in reward signal stability and generalization across multiple benchmarks, outperforming existing RM optimization methods. Our approach effectively mitigates attention bias, establishing a new state-of-the-art in robust reward modeling.

Technology Category

Application Category

📝 Abstract

The reward model (RM), as the core component of reinforcement learning from human feedback (RLHF) for large language models (LLMs), responsible for providing reward signals to generated responses. However, mainstream preference modeling in RM is inadequate in terms of token-level interaction, making its judgment signals vulnerable to being hacked by misallocated attention to context. This stems from two fundamental limitations: (1) Current preference modeling employs decoder-only architectures, where the unidirectional causal attention mechanism leads to forward-decaying intra-sequence attention within the prompt-response sequence. (2) The independent Siamese-encoding paradigm induces the absence of token-level inter-sequence attention between chosen and rejected sequences. To address this "attention hacking", we propose "Interaction Distillation", a novel training framework for more adequate preference modeling through attention-level optimization. The method introduces an interaction-based natural language understanding model as the teacher to provide sophisticated token interaction patterns via comprehensive attention, and guides the preference modeling to simulate teacher model's interaction pattern through an attentional alignment objective. Through extensive experiments, interaction distillation has demonstrated its ability to provide more stable and generalizable reward signals compared to state-of-the-art RM optimization methods that target data noise, highlighting the attention hacking constitute a more fundamental limitation in RM.

Problem

Research questions and friction points this paper is trying to address.

Mitigating attention hacking in reward models

Improving token-level interaction in preference modeling

Addressing unidirectional attention decay in decoder-only architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Interaction Distillation for attention-level optimization

Uses teacher model for token interaction patterns

Aligns attention to prevent reward signal hacking

🔎 Similar Papers

No similar papers found.

Authors to Follow