The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning

πŸ“… 2026-06-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the false-positive bias in process reward models (PRMs) caused by data imbalance during step-level training, which leads to over-rewarding of incorrect yet seemingly plausible reasoning steps. The authors propose PRISM, a novel framework that is the first to reveal the asymmetric downstream impact of this bias and reframes the training objective from pointwise label fitting to reliable relative ranking. PRISM employs policy-aware contrastive learning, leveraging a lookahead policy to generate hard negative samples and incorporating difficulty-aware curriculum learning to dynamically adjust the contrastive marginβ€”all without requiring additional annotations. Experiments demonstrate that PRISM reduces the false-positive rate by 22% on PRMBench and ProcessBench while achieving significantly higher macro F1 than strong baselines. Moreover, it improves accuracy by up to 22% in guided decoding and 33% in Best-of-N selection, substantially enhancing reasoning robustness.
πŸ“ Abstract
Process Reward Models (PRMs) improve credit assignment for reasoning by providing step-level feedback. However, we identify a hidden bias in PRMs caused by severe imbalance in step-level training data. Standard cross-entropy training amplifies this bias, causing PRMs to overcredit plausible but incorrect steps and produce high false-positive rates. We show that these false positives have an asymmetric downstream effect: false negatives mainly slow exploration, whereas false positives actively steer Best-of-N selection, guided decoding, and policy optimization toward flawed reasoning. This suggests that PRM training should shift from pointwise label fitting to reliable relative comparisons. To address this, we propose PRISM (Precision Ranking for Improved Step Modeling), a policy-aware PRM training framework that learns from contrastive step-level comparisons and hard negatives generated by a temporal lookahead strategy, requiring no new human labels. We further use a difficulty-aware curriculum to optimize the contrastive step margin. Across PRMBench and ProcessBench, PRISM substantially reduces false positives (22% on PRMBench) and improves macro F1 over strong discriminative PRMs. When applied to policy optimization and search tasks, including guided decoding and Best-of-N selection, it consistently improves accuracy (up to 22% for guided decoding and 33% for Best-of-N) and robustness. More broadly, trustworthy process supervision is not just about assigning high rewards, but about rewarding the right reasoning for the right reasons.
Problem

Research questions and friction points this paper is trying to address.

Process Reward Models
hidden bias
false positives
step-level feedback
credit assignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Process Reward Models
False Positive Reduction
Contrastive Learning
Policy-Aware Training
Curriculum Learning