🤖 AI Summary
This work addresses the inefficiency in sample utilization during reinforcement learning-based post-training of large language models, where a fixed rollout budget per prompt fails to account for varying signal quality across prompts. The authors propose CERO, the first method to integrate Bayesian variance estimation with Fenchel dual optimization, framing adaptive budget allocation as an online resource allocation problem under global constraints. By modeling prompt success probabilities with a Beta posterior and constructing a concave, saturating utility function based on Bayesian variance, CERO employs projected online gradient descent to update dual variables, enabling coupled budget allocation across both training rounds and prompts. The approach comes with theoretical regret bounds and demonstrates significant improvements over GRPO in sample efficiency across multiple open-source large language models and mathematical reasoning benchmarks.
📝 Abstract
LLM post-training often relies on reinforcement learning methods that sample multiple rollouts per prompt, yet most existing approaches use a fixed rollout budget for every prompt, despite large differences in the training signal different prompts provide. In this paper, we study adaptive rollout allocation under a fixed global budget and formulate the problem as online resource allocation with prompt-level diminishing returns. Our method, CERO, maintains a Beta posterior over each prompt's success probability and uses the posterior expected Bernoulli variance as a Bayesian estimate of the value of additional rollouts. We use this estimate to construct a concave, saturating utility over cumulative allocations, yielding an objective in which decisions across prompts and epochs are coupled by the global budget. Since the resulting objective is temporally nonseparable, we derive a Fenchel-dual reformulation and update both prompt-level and budget-level dual variables via projected online gradient descent. Under fixed prompt utilities, we prove an $O(\sqrt{K})$ regret bound against the offline allocation benchmark. Experiments on mathematical-reasoning problems show that CERO consistently outperforms GRPO across multiple open-weight LLMs and benchmarks, demonstrating that adaptive rollout budgeting can improve sample efficiency.