🤖 AI Summary
This work addresses the inefficiency of synchronous on-policy reinforcement learning in large-group training, where straggler nodes significantly degrade performance. To mitigate this issue, the authors propose Straggler-Aware Group Control (SAGC), which formulates dynamic group size selection as an online constrained optimization problem for the first time. By adaptively adjusting group sizes in real time, SAGC preserves the benefits of large-batch training while effectively constraining the long-term occurrence rate of stragglers. Implemented within the GRPO/DAPO framework, SAGC substantially reduces straggler frequency, enhances wall-clock efficiency and training stability, and ultimately achieves higher rewards. Moreover, it outperforms the best static group-size baseline on downstream inference tasks, generating shorter yet higher-quality outputs.
📝 Abstract
Synchronous reinforcement learning methods such as Group Relative Policy Optimization (GRPO) provide stable and reproducible on-policy training, but they are highly vulnerable to stragglers, a single unusually long rollout can delay reward computation and parameter updates for the entire group. This problem becomes more severe as group size increases, creating a tension between the benefits of larger groups and the wall-clock cost of synchronization stalls. We propose Straggler-Aware Group Control (SAGC), a dynamic group-size controller that adapts the training group online based on observed rollout behavior. SAGC formulates group-size selection as an online constrained optimization problem, seeking to retain the benefits of larger groups while controlling the long-term rate of straggler events. Across synchronous GRPO and DAPO training, and on top of both vanilla and strong engineered baselines, SAGC consistently reduces straggler incidence and improves wall-clock efficiency while achieving competitive or better training reward. We further show that these gains transfer to final model quality: SAGC is competitive with or better than the strongest static group-size baseline on downstream reasoning benchmarks, and often produces shorter outputs without any explicit length penalty. These results position dynamic group control as a practical way to make synchronous on-policy RL more efficient and robust.