🤖 AI Summary
This work addresses the high computational cost and low training efficiency of Group Relative Policy Optimization (GRPO) for mathematical reasoning, which stems from the need to generate multiple completions per training step. The authors propose the first diversity-aware reward reweighting mechanism integrated into GRPO, leveraging Maximal Marginal Relevance (MMR) to evaluate the semantic diversity of generated completions and dynamically reweight their rewards. This approach prioritizes more informative samples while significantly reducing redundant computation. Extensive experiments across three model scales, three GRPO variants, and five mathematical reasoning benchmarks demonstrate that the method reduces training steps by 47.9% and actual training time by 70.2% on average, while maintaining comparable peak performance.
📝 Abstract
Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of training steps required to reach peak performance, the overall wall-clock training time often remains unchanged or even increases due to higher per-step cost. We propose MMR-GRPO, which integrates Maximal Marginal Relevance to reweigh rewards based on completion diversity. Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time. These gains are consistent across models, methods, and benchmarks. We will release our code, trained models, and experimental protocols.