EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

170K/year
🤖 AI Summary
This work addresses the lack of robustness verification in existing process reward models (PRMs) under label-preserving structural perturbations, which can lead to unstable scoring. The authors propose the EST-PRM framework to systematically evaluate PRM sensitivity to three types of transformations: step inflation, dependency-aware reordering, and confidence tagging. They introduce a vulnerability decomposition to disentangle reward inflation from losses in correctness sensitivity. Experiments on MATH-500, GSM8K, and PRMBench reveal that Math-Shepherd is most sensitive to positional perturbations, Qwen2.5-Math-PRM suffers most from step inflation, and confidence perturbations significantly degrade reward calibration. The study further proposes targeted mitigation strategies and quantifies their trade-offs between robustness coverage and false positive rates.
📝 Abstract
Process reward models (PRMs) are widely used in language-model training with dense step-level supervision. They assume PRM scores are stable proxies for step correctness under label-preserving transformations. These transformations change reasoning structure but preserve final answers. We argue this assumption is not well validated. Such transformations can change how PRM scores relate to correctness signals, leading to different failure modes across models.To address this gap, we introduce \textbf{EST-PRM}, a stress-testing framework for dense process rewards. It applies three transformations: (1) step inflation, (2) dependency-aware step reordering, and (3) confidence markers. A vulnerability decomposition is defined that separates reward inflation from loss of correctness sensitivity. Five PRM-style models are evaluated on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench.The results indicate clear differences in vulnerability patterns across models. Math-Shepherd shows the strongest sensitivity to position perturbations, with a Pearson correlation drop of $0.152 \pm 0.038$ and a $32.8 \pm 4.9\%$ score inflation rate. Qwen2.5-Math-PRM is most affected by step inflation, reaching a $47.6 \pm 4.3\%$ inflation rate. Confidence-based perturbations also distort reward calibration, revealing inconsistencies in correctness estimation. Three mitigation strategies are evaluated, highlighting trade-offs between robustness coverage and false-positive rates.
Problem

Research questions and friction points this paper is trying to address.

Process Reward Models
Label-preserving Transformations
Reward Stability
Correctness Sensitivity
Stress-testing
Innovation

Methods, ideas, or system contributions that make the work stand out.

stress-testing
process reward models
label-preserving transformations
vulnerability decomposition
reward calibration
🔎 Similar Papers