Reward Reasoning Model

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

To address the challenge of ambiguous and discriminatively weak reward signals in large language model alignment—particularly under complex queries where accurate, real-time reward estimation is difficult—this paper proposes the Reward Reasoning Model (RRM). RRM incorporates chain-of-thought reasoning at inference time, implicitly modeling reasoning paths via a self-evolving mechanism that adaptively allocates computational resources without requiring human-annotated reasoning traces. It employs an end-to-end reinforcement learning framework that jointly optimizes reward prediction and implicit reasoning. Evaluated across multiple domain-specific reward modeling benchmarks, RRM consistently outperforms state-of-the-art methods in reward accuracy and robustness. The pre-trained RRM models are publicly released on Hugging Face to facilitate reproducibility and further research.

Technology Category

Application Category

📝 Abstract

Reward models play a critical role in guiding large language models toward outputs that align with human expectations. However, an open challenge remains in effectively utilizing test-time compute to enhance reward model performance. In this work, we introduce Reward Reasoning Models (RRMs), which are specifically designed to execute a deliberate reasoning process before generating final rewards. Through chain-of-thought reasoning, RRMs leverage additional test-time compute for complex queries where appropriate rewards are not immediately apparent. To develop RRMs, we implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities without requiring explicit reasoning traces as training data. Experimental results demonstrate that RRMs achieve superior performance on reward modeling benchmarks across diverse domains. Notably, we show that RRMs can adaptively exploit test-time compute to further improve reward accuracy. The pretrained reward reasoning models are available at https://huggingface.co/Reward-Reasoning.

Problem

Research questions and friction points this paper is trying to address.

Enhancing reward model performance using test-time compute

Developing self-evolved reward reasoning without explicit training data

Improving reward accuracy adaptively for complex queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

RRMs use chain-of-thought reasoning for rewards

Reinforcement learning enables self-evolved reward reasoning

RRMs adaptively exploit test-time compute

🔎 Similar Papers

No similar papers found.