Uncertainty-Aware Step-wise Verification with Generative Reward Models

📅 2025-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Process reward models (PRMs) for large language models—widely used for step-level supervision in multi-step tasks such as mathematical reasoning—are vulnerable to reward attacks and exhibit unstable discrimination. Method: We propose an uncertainty quantification (UQ) framework grounded in chain-of-thought (CoT) entropy, the first to deeply integrate confidence estimation into the step-wise verification pipeline of generative PRMs. Our approach jointly models generative reward scoring, process supervision, and robustness-aware calibration to enable uncertainty-aware verification. Contribution/Results: Experiments demonstrate that our method significantly suppresses erroneous reward signals, improving both discriminative accuracy and adversarial robustness of PRMs across multiple mathematical reasoning benchmarks (e.g., GSM8K, MATH, AIME). By explicitly modeling epistemic uncertainty at each reasoning step, it establishes a new paradigm for reliable, interpretable, and resilient process supervision in complex reasoning tasks.

Technology Category

Application Category

📝 Abstract
Complex multi-step reasoning tasks, such as solving mathematical problems, remain challenging for large language models (LLMs). While outcome supervision is commonly used, process supervision via process reward models (PRMs) provides intermediate rewards to verify step-wise correctness in solution traces. However, as proxies for human judgement, PRMs suffer from reliability issues, including susceptibility to reward hacking. In this work, we propose leveraging uncertainty quantification (UQ) to enhance the reliability of step-wise verification with generative reward models for mathematical reasoning tasks. We introduce CoT Entropy, a novel UQ method that outperforms existing approaches in quantifying a PRM's uncertainty in step-wise verification. Our results demonstrate that incorporating uncertainty estimates improves the robustness of judge-LM PRMs, leading to more reliable verification.
Problem

Research questions and friction points this paper is trying to address.

Enhance reliability in multi-step reasoning tasks
Address reliability issues in process reward models
Improve verification robustness with uncertainty quantification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uncertainty quantification enhances verification
CoT Entropy improves step-wise reliability
Generative reward models leverage process supervision
🔎 Similar Papers
No similar papers found.