Reward Learning from Best-of-$N$ Preference Data: Targets, Tradeoffs, and Design Principles

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the long-standing ambiguity in Bradley–Terry (BT) reward learning under Best-of-N preference data, where both the effective reward target and the roles of N and the base policy lack theoretical grounding. The study formally identifies the true reward objective in this setting by analyzing the conditional distribution induced by Best-of-N sampling, deriving a closed-form solution under an independent reference variant and characterizing the approximate representability of BT models in practical variants. Both theoretical analysis and experiments demonstrate that the reward target critically depends on N and the base policy: large N alleviates labeling bottlenecks, whereas small N better suits generation bottlenecks; moreover, the base policy should concentrate on response regions most relevant during test-time deployment. These insights yield two practical design principles, offering rigorous theoretical support and actionable guidance for preference-based learning.

📝 Abstract

Best-of-$N$ sampling is widely used to construct pairwise preference data: $N$ candidates are drawn from a base distribution, and the best is paired with a rejected response. Despite its widespread use, what Bradley--Terry (BT) reward learning extracts from such data, and how to choose $N$ and the base distribution, remain unclear. We specialize a recent analysis of preference data via its induced conditional distribution to Best-of-$N$. For independent-reference variants, we derive closed-form reward targets as explicit functions of $N$ and the base distribution, and show that they preserve the latent reward ranking. For the practical Best-vs-Random and Best-vs-Worst variants, chosen and rejected responses are coupled through the same candidate set, so exact BT representability generally fails; nevertheless, bounded-class minimizers approach the reference targets as $N$ grows. Although margin and connectivity are known to govern sample efficiency in pairwise preference learning, Best-of-$N$ couples them through $N$ in opposing directions: larger $N$ widens pairwise margins but reduces connectivity. This trade-off yields two design principles: use larger $N$ when preference labels are the bottleneck, smaller $N$ when generation is the bottleneck; and shape the base distribution to place mass between the responses whose comparison matters most at test time. Experiments on synthetic and real preference data support the predicted dependence on sample size and base-distribution shape.

Problem

Research questions and friction points this paper is trying to address.

Best-of-N

reward learning

preference data

Bradley-Terry model

base distribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Best-of-N sampling

Bradley-Terry reward learning

preference data