Better Language Model-Based Judging Reward Modeling through Scaling Comprehension Boundaries

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited contextual understanding and poor out-of-distribution (OOD) generalization of language-model-based discriminative reward modeling. To this end, we propose a unified framework integrating natural language inference (NLI) with reward modeling. Our core innovation is an interpretable slot-prediction mechanism coupled with a two-stage masked language model, which explicitly encodes contextual explanations into structured semantic slots—thereby enhancing the model’s capacity for deep semantic reasoning in preference judgment. This design significantly improves both the stability of reward signals and OOD generalization performance. Empirical results demonstrate that our method consistently outperforms state-of-the-art generative and discriminative reward models across RLHF benchmarks and diverse OOD settings. By grounding reward estimation in explainable, structured semantics, our approach establishes a novel paradigm for trustworthy and robust alignment learning.

Technology Category

Application Category

📝 Abstract
The emergence of LM-based judging reward modeling, represented by generative reward models, has successfully made reinforcement learning from AI feedback (RLAIF) efficient and scalable. To further advance this paradigm, we propose a core insight: this form of reward modeling shares fundamental formal consistency with natural language inference (NLI), a core task in natural language understanding. This reframed perspective points to a key path for building superior reward models: scaling the model's comprehension boundaries. Pursuing this path, exploratory experiments on NLI tasks demonstrate that the slot prediction masked language models (MLMs) incorporating contextual explanations achieve significantly better performance compared to mainstream autoregressive models. Based on this key finding, we propose ESFP-RM, a two-stage LM-based judging reward model that utilizes an explanation based slot framework for prediction to fully leverage the advantages of MLMs. Extensive experiments demonstrate that in both reinforcement learning from human feedback (RLHF) and out-of-distribution (OOD) scenarios, the ESFP-RM framework delivers more stable and generalizable reward signals compared to generative reward models.
Problem

Research questions and friction points this paper is trying to address.

Enhancing reward model comprehension boundaries through scaling
Improving LM-based judging reward modeling via NLI formal consistency
Developing stable generalizable reward signals for RLHF and OOD scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using masked language models for reward modeling
Integrating contextual explanations into slot prediction
Two-stage framework for stable reward signals
🔎 Similar Papers
No similar papers found.
M
Meiling Ning
Beijing University of Posts and Telecommunications
Zhongbao Zhang
Zhongbao Zhang
Associate Professor of Beijing University of Posts and Telecommunications
Computer networkBig DataSocial Network
J
Junda Ye
Beijing University of Posts and Telecommunications
J
Jiabao Guo
Beijing University of Posts and Telecommunications
Q
Qingyuan Guan
Beijing University of Posts and Telecommunications