Prompt Augmentation Scales up GRPO Training on Mathematical Reasoning

πŸ“… 2026-02-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the instability and limited scalability in GRPO-based reinforcement learning caused by policy entropy collapse. To mitigate this issue, the authors introduce a prompt-augmentation mechanism into the GRPO framework for the first time, leveraging multiple reasoning templates to generate diverse rollout trajectories. This approach effectively alleviates entropy collapse without requiring KL regularization, thereby significantly enhancing training stability and exploration capability. Empirical validation demonstrates that the resulting Qwen2.5-Math-1.5B model achieves state-of-the-art performance on mathematical reasoning benchmarks such as MATH Levels 3–5, attaining an average accuracy of 45.2% and a single-problem accuracy of 51.8%, which establishes new records and confirms the method’s effectiveness and scalability.

Technology Category

Application Category

πŸ“ Abstract
Reinforcement learning algorithms such as group-relative policy optimization (GRPO) have demonstrated strong potential for improving the mathematical reasoning capabilities of large language models. However, prior work has consistently observed an entropy collapse phenomenon during reinforcement post-training, characterized by a monotonic decrease in policy entropy that ultimately leads to training instability and collapse. As a result, most existing approaches restrict training to short horizons (typically 5-20 epochs), limiting sustained exploration and hindering further policy improvement. In addition, nearly all prior work relies on a single, fixed reasoning prompt or template during training. In this work, we introduce prompt augmentation, a training strategy that instructs the model to generate reasoning traces under diverse templates and formats, thereby increasing rollout diversity. We show that, without a KL regularization term, prompt augmentation enables stable scaling of training duration under a fixed dataset and allows the model to tolerate low-entropy regimes without premature collapse. Empirically, a Qwen2.5-Math-1.5B model trained with prompt augmentation on the MATH Level 3-5 dataset achieves state-of-the-art performance, reaching 45.2 per-benchmark accuracy and 51.8 per-question accuracy on standard mathematical reasoning benchmarks, including AIME24, AMC, MATH500, Minerva, and OlympiadBench. The code and model checkpoints are available at https://github.com/wenquanlu/prompt-augmentation-GRPO.
Problem

Research questions and friction points this paper is trying to address.

entropy collapse
reinforcement learning
mathematical reasoning
training instability
prompt diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt augmentation
GRPO
entropy collapse
mathematical reasoning
reinforcement learning
πŸ”Ž Similar Papers
No similar papers found.