🤖 AI Summary
This work addresses the challenge of computational inefficiency in long-context reinforcement learning, where dense rollouts incur high costs and static sparse rollouts often lead to training instability. The authors identify that sparse rollout collapse stems not from overall performance degradation but from distributional mismatches in tail tokens across policies. To mitigate this, they propose a dynamic sparsification scheduler that monitors per-token distributional divergence between actor and policy outputs, maintaining a constant threshold for tail-token mismatch. Combined with a lightweight LoRA-based distillation mechanism (DistillSparse), this approach enhances sparsity strength while preserving stability. Evaluated on the Qwen3 model series, the method achieves 2.0×–2.4× rollout acceleration and demonstrates strong generalization on larger models and programming tasks.
📝 Abstract
Despite being powerful, reinforcement learning with verifiable rewards (RLVR) induces extremely long COT, making it computationally expensive. Since RLVR per-step cost is dominated by long-context rollout generation, sparse attention offers a promising way to accelerate dense rollout. However, sparse rollouts require a delicate stability-efficiency tradeoff: overly aggressive sparsity causes collapse, while overly lenient sparsity gives insufficient speedup. In this work, we study this tradeoff through sparse-to-dense actor-policy mismatch. We first observe that sparse rollout collapse is not driven by uniform degradation across tokens: most sparse tokens align perfectly with dense even under aggressive sparsity. Motivated by this, we hypothesize that sparse rollout training remains stable if the lower tail of per-token actor-policy mismatch stays above a critical threshold throughout the trajectory. We introduce a dynamic sparsity schedule that keeps this tail statistic constant during generation and validate our hypothesis. Across Qwen3 thinking-family models, keeping the tail mismatch statistic near a consistent threshold generally enables stable training. We then use a cost model to find the sparsity schedule for maximum speedup under this mismatch threshold, achieving 2.2x, 2.4x, and 2.0x rollout speedups when training Qwen3-1.7B, Qwen3-4B, and Qwen3-8B. Empirically, we show the thresholds generalize to a larger model (Qwen3-14B) and another RL domain (coding). Finally, our analysis naturally motivates DistillSparse: lightweight LoRA-based distillation on sparse rollout lets more aggressive sparsity reach the same sparse-to-dense mismatch threshold, yielding higher speedup.