Uncertainty-Aware Step-wise Verification with Generative Reward Models

📅 2025-02-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Process reward models (PRMs) for large language models—widely used for step-level supervision in multi-step tasks such as mathematical reasoning—are vulnerable to reward attacks and exhibit unstable discrimination. Method: We propose an uncertainty quantification (UQ) framework grounded in chain-of-thought (CoT) entropy, the first to deeply integrate confidence estimation into the step-wise verification pipeline of generative PRMs. Our approach jointly models generative reward scoring, process supervision, and robustness-aware calibration to enable uncertainty-aware verification. Contribution/Results: Experiments demonstrate that our method significantly suppresses erroneous reward signals, improving both discriminative accuracy and adversarial robustness of PRMs across multiple mathematical reasoning benchmarks (e.g., GSM8K, MATH, AIME). By explicitly modeling epistemic uncertainty at each reasoning step, it establishes a new paradigm for reliable, interpretable, and resilient process supervision in complex reasoning tasks.

Technology Category

Application Category

📝 Abstract

Complex multi-step reasoning tasks, such as solving mathematical problems, remain challenging for large language models (LLMs). While outcome supervision is commonly used, process supervision via process reward models (PRMs) provides intermediate rewards to verify step-wise correctness in solution traces. However, as proxies for human judgement, PRMs suffer from reliability issues, including susceptibility to reward hacking. In this work, we propose leveraging uncertainty quantification (UQ) to enhance the reliability of step-wise verification with generative reward models for mathematical reasoning tasks. We introduce CoT Entropy, a novel UQ method that outperforms existing approaches in quantifying a PRM's uncertainty in step-wise verification. Our results demonstrate that incorporating uncertainty estimates improves the robustness of judge-LM PRMs, leading to more reliable verification.

Problem

Research questions and friction points this paper is trying to address.

Enhance reliability in multi-step reasoning tasks

Address reliability issues in process reward models

Improve verification robustness with uncertainty quantification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uncertainty quantification enhances verification

CoT Entropy improves step-wise reliability

Generative reward models leverage process supervision

🔎 Similar Papers

No similar papers found.

Authors to Follow