MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting

📅 2026-01-14

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the high computational cost and low training efficiency of Group Relative Policy Optimization (GRPO) for mathematical reasoning, which stems from the need to generate multiple completions per training step. The authors propose the first diversity-aware reward reweighting mechanism integrated into GRPO, leveraging Maximal Marginal Relevance (MMR) to evaluate the semantic diversity of generated completions and dynamically reweight their rewards. This approach prioritizes more informative samples while significantly reducing redundant computation. Extensive experiments across three model scales, three GRPO variants, and five mathematical reasoning benchmarks demonstrate that the method reduces training steps by 47.9% and actual training time by 70.2% on average, while maintaining comparable peak performance.

Technology Category

Application Category

📝 Abstract

Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of training steps required to reach peak performance, the overall wall-clock training time often remains unchanged or even increases due to higher per-step cost. We propose MMR-GRPO, which integrates Maximal Marginal Relevance to reweigh rewards based on completion diversity. Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time. These gains are consistent across models, methods, and benchmarks. We will release our code, trained models, and experimental protocols.

Problem

Research questions and friction points this paper is trying to address.

GRPO

mathematical reasoning

training efficiency

computational cost

reward reweighting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Maximal Marginal Relevance

reward reweighting

diversity-aware training