Confidence as a Reward: Transforming LLMs into Reward Models

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the reliance of large language model (LLM) reasoning enhancement on extensive labeled data and costly supervised training. We propose Confidence-as-a-Reward (CRew), a training-free, confidence-driven reward method that leverages token-level generation confidence—natively produced by LLMs—as a fine-grained, interpretable reward signal. CRew integrates chain-of-thought (CoT) identification with training-free reward modeling, providing the first systematic validation of confidence as a viable metric for evaluating reasoning paths. Furthermore, we introduce CRew-DPO, a self-training strategy that constructs preference pairs by jointly incorporating token-level confidence and final answer correctness. Experiments demonstrate that CRew significantly outperforms existing training-free reward methods on MATH500 and RewardMATH, achieving performance comparable to—or exceeding—that of most supervised models. Moreover, CRew enables efficient selection of high-quality reasoning paths for data distillation.

Technology Category

Application Category

📝 Abstract
Reward models can significantly enhance the reasoning capabilities of large language models (LLMs), but they typically require extensive curated data and costly training. To mitigate these challenges, training-free approaches such as LLM-as-a-Judge leverage the intrinsic reasoning abilities of LLMs to evaluate responses, achieving promising results. Recent works have also indicated that model confidence can serve effectively as a reward metric, distinguishing between chain-of-thought (CoT) and non-CoT paths. However, the concept of using confidence as a reward has not been comprehensively studied. In this work, we systematically investigate Confidence-as-a-Reward (CRew), a simple yet powerful training-free method that utilizes token-level confidence in the model's final answers as a proxy for reward, especially suitable for close-ended tasks. Through extensive experiments on mathematical reasoning tasks, we demonstrate that CRew outperforms existing training-free reward approaches on the MATH500 and RewardMATH benchmarks, and even surpasses most trained reward models. We further identify a strong correlation between CRew scores and the actual reasoning performance of the model. Additionally, we find that CRew can effectively filter high-quality training data. Building upon these insights, we propose CRew-DPO, a training strategy that constructs preference data from confidence scores combined with correctness signals. Finetuning with CRew-DPO further enhances the model's judging capabilities and consistently outperforms existing self-training methods.
Problem

Research questions and friction points this paper is trying to address.

Developing training-free reward models using LLM confidence scores
Enhancing mathematical reasoning without curated data or costly training
Creating self-training methods via confidence-based preference data construction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses token-level confidence as reward metric
Applies training-free method for close-ended tasks
Proposes CRew-DPO strategy for enhanced training
🔎 Similar Papers
No similar papers found.
He Du
He Du
Northwestern Polytechnical University
ubiquitous computingdata miningmobile sensing
B
Bowen Li
Shanghai AI Laboratory
C
Chengxing Xie
Xidian University
C
Chang Gao
The Chinese University of Hong Kong
K
Kai Chen
Shanghai AI Laboratory
Dacheng Tao
Dacheng Tao
Nanyang Technological University
artificial intelligencemachine learningcomputer visionimage processingdata mining