Reasoning-Aware GRPO using Process Mining

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing reinforcement learning (RL) post-training methods predominantly employ outcome-oriented reward signals, which are insufficient for optimizing multi-step reasoning in Large Reasoning Models (LRMs). Method: We propose Group Relative Policy Optimization (GRPO), the first RL framework to incorporate process mining into reasoning modeling. GRPO defines a scalar process reward based on fitness—measuring alignment between a policy model’s generated reasoning trajectory and a reference process derived from a pre-trained teacher model—and dynamically optimizes trajectory-level consistency. Contribution/Results: This shifts the optimization paradigm from outcome fidelity to process consistency. Evaluated on five benchmark reasoning tasks, GRPO significantly outperforms existing GRPO variants, achieving simultaneous improvements in both reasoning consistency and final answer accuracy—demonstrating that process-aware supervision effectively enhances multi-step reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL)-based post-training has been crucial for enabling multi-step reasoning in large reasoning models (LRMs), yet current reward schemes are typically outcome-centric. We propose PM4GRPO, a reasoning-aware Group Relative Policy Optimization (GRPO) that augments standard answer/format rewards with signals over the reasoning procedure. To this end, process mining techniques are utilized to compute a scalar conformance reward that measures how closely a policy model's reasoning aligns with the pretrained teacher model. The empirical results on five benchmarks demonstrate that PM4GRPO significantly outperforms existing methodologies for GRPO-based post-training. These results highlight that leveraging process mining for reasoning-aware GRPO effectively enhances the reasoning capabilities of policy models.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-step reasoning in large models

Addressing outcome-centric limitations in reward schemes

Measuring reasoning alignment with teacher models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Process mining measures reasoning conformance with teacher model

Augmenting GRPO with reasoning-aware signals from procedure

Scalar conformance reward aligns policy with teacher reasoning

🔎 Similar Papers

No similar papers found.

Authors to Follow