ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing process reward models (PRMs) evaluate only final responses, rendering them ill-suited for robustly supervising long, chain-of-thought intermediate reasoning traces generated by state-of-the-art reasoning models (e.g., DeepSeek-R1). To address this, we propose ReasonFlux-PRM—a trajectory-aware PRM framework that enables joint step-level and trajectory-level supervision, specifically designed for the trajectory-response dual-output paradigm. Trained on structured chain-of-thought data and integrating both offline human annotations and online preference feedback, ReasonFlux-PRM seamlessly supports model distillation, reinforcement learning policy optimization, and test-time Best-of-N scaling. On benchmarks including AIME, MATH500, and GPQA-Diamond, the 7B variant achieves average improvements of 12.1% (fine-tuning), 4.5% (RL), and 6.3% (test-time scaling). We further release a lightweight 1.5B version optimized for edge deployment.

Technology Category

Application Category

📝 Abstract

Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model final output responses and struggle to evaluate intermediate thinking trajectories robustly, especially in the emerging setting of trajectory-response outputs generated by frontier reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a novel trajectory-aware PRM explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. We adapt ReasonFlux-PRM to support reward supervision under both offline and online settings, including (i) selecting high-quality model distillation data for downstream supervised fine-tuning of smaller models, (ii) providing dense process-level rewards for policy optimization during reinforcement learning, and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs (e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling. We also release our efficient ReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment. Projects: https://github.com/Gen-Verse/ReasonFlux

Problem

Research questions and friction points this paper is trying to address.

Evaluating intermediate reasoning steps in LLMs robustly

Supervising both step-level and trajectory-level reasoning traces

Improving performance in fine-tuning, RL, and test-time scaling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trajectory-aware PRM for reasoning evaluation

Step and trajectory-level supervision integration

Supports offline and online reward supervision

🔎 Similar Papers

No similar papers found.