Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

📅 2025-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the vulnerability of Process Reward Models (PRMs) to reward hacking and their inability to reliably identify erroneous intermediate reasoning steps, this paper proposes the Hierarchical Reward Model (HRM). HRM jointly models reasoning processes at two granularities—fine-grained single-step and coarse-grained consecutive-step levels—to explicitly assess reasoning coherence and self-reflection capability. We further introduce Hierarchical Node Compression (HNC), a lightweight data augmentation strategy that enhances label robustness by compressing redundant or noisy reasoning nodes. HRM integrates Monte Carlo Tree Search with multi-scale reasoning evaluation, achieving significant improvements in stability and reliability over baselines on PRM800K. Cross-task generalization experiments on MATH500 and GSM8K demonstrate substantial gains in both accuracy and robustness. The core contributions are (1) a novel hierarchical evaluation paradigm that decouples local correctness from global coherence, and (2) the HNC mechanism, which effectively mitigates reward hacking while preserving semantic fidelity.

Technology Category

Application Category

📝 Abstract
Recent studies show that Large Language Models (LLMs) achieve strong reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a key approach, the Process Reward Model (PRM), suffers from reward hacking, making it unreliable in identifying the best intermediate steps. In this paper, we propose a novel reward model approach, Hierarchical Reward Model (HRM), which evaluates both individual and consecutive reasoning steps from fine-grained and coarse-grained level. HRM performs better in assessing reasoning coherence and self-reflection, particularly when the previous reasoning step is incorrect. Furthermore, to address the inefficiency of autonomous generating PRM training data via Monte Carlo Tree Search (MCTS), we introduce a lightweight and effective data augmentation strategy called Hierarchical Node Compression (HNC) based on node merging (combining two consecutive reasoning steps into one step) in the tree structure. This approach diversifies MCTS results for HRM with negligible computational overhead, enhancing label robustness by introducing noise. Empirical results on the PRM800K dataset demonstrate that HRM, in conjunction with HNC, achieves superior stability and reliability in evaluation compared to PRM. Furthermore, cross-domain evaluations on MATH500 and GSM8K confirm HRM's superior generalization and robustness across diverse reasoning tasks. The code for all experiments will be released at https: //github.com/tengwang0318/hierarchial_reward_model.
Problem

Research questions and friction points this paper is trying to address.

Addresses reward hacking in Process Reward Models (PRM).
Proposes Hierarchical Reward Model (HRM) for better reasoning evaluation.
Introduces Hierarchical Node Compression (HNC) for efficient data augmentation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Reward Model (HRM) evaluates reasoning steps.
Hierarchical Node Compression (HNC) enhances data augmentation.
HRM with HNC improves stability and generalization.
🔎 Similar Papers
2024-02-26Annual Meeting of the Association for Computational LinguisticsCitations: 97