MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting

📅 2026-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost and low training efficiency of Group Relative Policy Optimization (GRPO) for mathematical reasoning, which stems from the need to generate multiple completions per training step. The authors propose the first diversity-aware reward reweighting mechanism integrated into GRPO, leveraging Maximal Marginal Relevance (MMR) to evaluate the semantic diversity of generated completions and dynamically reweight their rewards. This approach prioritizes more informative samples while significantly reducing redundant computation. Extensive experiments across three model scales, three GRPO variants, and five mathematical reasoning benchmarks demonstrate that the method reduces training steps by 47.9% and actual training time by 70.2% on average, while maintaining comparable peak performance.

Technology Category

Application Category

📝 Abstract
Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of training steps required to reach peak performance, the overall wall-clock training time often remains unchanged or even increases due to higher per-step cost. We propose MMR-GRPO, which integrates Maximal Marginal Relevance to reweigh rewards based on completion diversity. Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time. These gains are consistent across models, methods, and benchmarks. We will release our code, trained models, and experimental protocols.
Problem

Research questions and friction points this paper is trying to address.

GRPO
mathematical reasoning
training efficiency
computational cost
reward reweighting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Maximal Marginal Relevance
reward reweighting
diversity-aware training
Group Relative Policy Optimization
mathematical reasoning
🔎 Similar Papers
No similar papers found.