A Systematic Analysis of Base Model Choice for Reward Modeling

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This study systematically investigates how the choice of base large language model (LLM) affects reward model (RM) performance in reinforcement learning from human feedback (RLHF). Addressing the lack of principled guidance for base model selection, we make three key contributions: (1) Empirical evidence that optimal base model selection improves RM accuracy by up to 14%; (2) A computationally efficient screening method leveraging small-scale, multi-benchmark evaluation, achieving an 18% relative improvement in identifying top-5–10 candidate models; and (3) A data distribution estimation mechanism that substantially reduces prediction error in RM performance. Through rigorous statistical analysis, cross-benchmark evaluation, and post-training validation, we establish—for the first time—a strong statistical correlation between upstream benchmark scores and downstream RLHF effectiveness. Our findings yield a reproducible, practice-oriented methodology for base model selection in RM development.

Technology Category

Application Category

📝 Abstract

Reinforcement learning from human feedback (RLHF) and, at its core, reward modeling have become a crucial part of training powerful large language models (LLMs). One commonly overlooked factor in training high-quality reward models (RMs) is the effect of the base model, which is becoming more challenging to choose given the rapidly growing pool of LLMs. In this work, we present a systematic analysis of the effect of base model selection on reward modeling performance. Our results show that the performance can be improved by up to 14% compared to the most common (i.e., default) choice. Moreover, we showcase the strong statistical relation between some existing benchmarks and downstream performances. We also demonstrate that the results from a small set of benchmarks could be combined to boost the model selection ($+$18% on average in the top 5-10). Lastly, we illustrate the impact of different post-training steps on the final performance and explore using estimated data distributions to reduce performance prediction error.

Problem

Research questions and friction points this paper is trying to address.

Analyzing base model choice impact on reward modeling performance

Exploring benchmarks' relation to downstream reward model performance

Investigating post-training steps' effect on final model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic analysis of base model selection impact

Combine benchmarks to boost model selection performance

Explore post-training steps and data distribution effects

🔎 Similar Papers

Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?