GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
GRPO faces three key challenges in chain-of-thought (CoT) reinforcement learning: coupling between reasoning-step and answer gradients, sparse rewards due to limited parallel sampling, and high variance and instability in advantage estimation. To address these, we propose GRPO-MA, whose core innovation is a **single-reasoning-path–multiple-answers (SRP-MA) generation mechanism**: for each sampled reasoning path, multiple answers are generated in parallel, thereby decoupling reasoning-step and answer gradients and substantially reducing advantage estimation variance. We provide theoretical analysis showing that SRP-MA mitigates both gradient coupling and reward sparsity. Empirically, GRPO-MA achieves stable performance gains across mathematical reasoning, code generation, and multimodal tasks—outperforming standard GRPO and several strong baselines. Moreover, its gains consistently improve with increasing numbers of sampled answers, demonstrating scalability and robustness.

Technology Category

Application Category

📝 Abstract
Recent progress, such as DeepSeek-R1, has shown that the GRPO algorithm, a Reinforcement Learning (RL) approach, can effectively train Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) and Vision-Language Models (VLMs). In this paper, we analyze three challenges of GRPO: gradient coupling between thoughts and answers, sparse reward signals caused by limited parallel sampling, and unstable advantage estimation. To mitigate these challenges, we propose GRPO-MA, a simple yet theoretically grounded method that leverages multi-answer generation from each thought process, enabling more robust and efficient optimization. Theoretically, we show that the variance of thought advantage decreases as the number of answers per thought increases. Empirically, our gradient analysis confirms this effect, showing that GRPO-MA reduces gradient spikes compared to GRPO. Experiments on math, code, and diverse multimodal tasks demonstrate that GRPO-MA substantially improves performance and training efficiency. Our ablation studies further reveal that increasing the number of answers per thought consistently enhances model performance.
Problem

Research questions and friction points this paper is trying to address.

Addresses gradient coupling between thoughts and answers in reasoning models
Mitigates sparse reward signals from limited parallel sampling
Improves unstable advantage estimation in reinforcement learning training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-answer generation per thought process
Reduces gradient spikes and variance
Improves training stability and efficiency
🔎 Similar Papers
No similar papers found.
H
Hongcheng Wang
CFCS, School of Computer Science, Peking University
Yinuo Huang
Yinuo Huang
Ph.D. Candidate, University of Electronic Science and Technology of China
Wireless communicationsmachine learning
S
Sukai Wang
Agibot
G
Guanghui Ren
Agibot
H
Hao Dong
CFCS, School of Computer Science, Peking University