π€ AI Summary
Existing process reward models assess only the local correctness of reasoning prefixes, failing to capture their actual contribution to final problem-solving success. This work introduces the concept of prefix gainβa measure of prefix utility quantified by the improvement in solve rates achieved by a lightweight ensemble of student models when conditioned on a given prefix. Building upon this, we propose a Prefix Utility Model (PUM) that employs pairwise ranking learning to assign utility scores to complete or partial reasoning trajectories. Our approach represents the first shift from local correctness to outcome-oriented, global impact in prefix evaluation. Empirically, it substantially enhances the quality of prefix-level supervision signals in mathematical reasoning tasks, particularly under conditions of large candidate pools, increased search budgets, or sparse rule-based rewards, thereby effectively optimizing the reasoning process.
π Abstract
Reasoning prefixes shape the future trajectory of LLM problem solving, yet existing process reward models usually evaluate them through local step correctness. We argue that correctness is a useful but indirect proxy for the effect we ultimately care about: whether a prefix increases the probability of successful completion. We define this effect as prefix gain, the solve-rate improvement induced by conditioning lightweight student model group on a prefix, and use it to train a Prefix Utility Model (PUM) with a simple pairwise ranking objective. PUM learns outcome-grounded prefix utility and can score both complete trajectories and partial reasoning prefixes. Across Best-of-$N$ selection, beam search, and reinforcement learning on mathematical reasoning, PUM provides a strong prefix-level supervision signal, especially when candidate pools are large, search budgets increase, or rule-based rewards are sparse. We release all data, models, and code at https://zhiqix.github.io/pum-project-page.