GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

GRPO faces three key challenges in chain-of-thought (CoT) reinforcement learning: coupling between reasoning-step and answer gradients, sparse rewards due to limited parallel sampling, and high variance and instability in advantage estimation. To address these, we propose GRPO-MA, whose core innovation is a **single-reasoning-path–multiple-answers (SRP-MA) generation mechanism**: for each sampled reasoning path, multiple answers are generated in parallel, thereby decoupling reasoning-step and answer gradients and substantially reducing advantage estimation variance. We provide theoretical analysis showing that SRP-MA mitigates both gradient coupling and reward sparsity. Empirically, GRPO-MA achieves stable performance gains across mathematical reasoning, code generation, and multimodal tasks—outperforming standard GRPO and several strong baselines. Moreover, its gains consistently improve with increasing numbers of sampled answers, demonstrating scalability and robustness.

Technology Category

Application Category

📝 Abstract

Recent progress, such as DeepSeek-R1, has shown that the GRPO algorithm, a Reinforcement Learning (RL) approach, can effectively train Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) and Vision-Language Models (VLMs). In this paper, we analyze three challenges of GRPO: gradient coupling between thoughts and answers, sparse reward signals caused by limited parallel sampling, and unstable advantage estimation. To mitigate these challenges, we propose GRPO-MA, a simple yet theoretically grounded method that leverages multi-answer generation from each thought process, enabling more robust and efficient optimization. Theoretically, we show that the variance of thought advantage decreases as the number of answers per thought increases. Empirically, our gradient analysis confirms this effect, showing that GRPO-MA reduces gradient spikes compared to GRPO. Experiments on math, code, and diverse multimodal tasks demonstrate that GRPO-MA substantially improves performance and training efficiency. Our ablation studies further reveal that increasing the number of answers per thought consistently enhances model performance.

Problem

Research questions and friction points this paper is trying to address.

Addresses gradient coupling between thoughts and answers in reasoning models

Mitigates sparse reward signals from limited parallel sampling

Improves unstable advantage estimation in reinforcement learning training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-answer generation per thought process

Reduces gradient spikes and variance

Improves training stability and efficiency

🔎 Similar Papers

No similar papers found.

Authors to Follow