Reasoning-Aware GRPO using Process Mining

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reinforcement learning (RL) post-training methods predominantly employ outcome-oriented reward signals, which are insufficient for optimizing multi-step reasoning in Large Reasoning Models (LRMs). Method: We propose Group Relative Policy Optimization (GRPO), the first RL framework to incorporate process mining into reasoning modeling. GRPO defines a scalar process reward based on fitness—measuring alignment between a policy model’s generated reasoning trajectory and a reference process derived from a pre-trained teacher model—and dynamically optimizes trajectory-level consistency. Contribution/Results: This shifts the optimization paradigm from outcome fidelity to process consistency. Evaluated on five benchmark reasoning tasks, GRPO significantly outperforms existing GRPO variants, achieving simultaneous improvements in both reasoning consistency and final answer accuracy—demonstrating that process-aware supervision effectively enhances multi-step reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL)-based post-training has been crucial for enabling multi-step reasoning in large reasoning models (LRMs), yet current reward schemes are typically outcome-centric. We propose PM4GRPO, a reasoning-aware Group Relative Policy Optimization (GRPO) that augments standard answer/format rewards with signals over the reasoning procedure. To this end, process mining techniques are utilized to compute a scalar conformance reward that measures how closely a policy model's reasoning aligns with the pretrained teacher model. The empirical results on five benchmarks demonstrate that PM4GRPO significantly outperforms existing methodologies for GRPO-based post-training. These results highlight that leveraging process mining for reasoning-aware GRPO effectively enhances the reasoning capabilities of policy models.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-step reasoning in large models
Addressing outcome-centric limitations in reward schemes
Measuring reasoning alignment with teacher models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Process mining measures reasoning conformance with teacher model
Augmenting GRPO with reasoning-aware signals from procedure
Scalar conformance reward aligns policy with teacher reasoning
🔎 Similar Papers
No similar papers found.