🤖 AI Summary
This work addresses the inefficiencies of static resource allocation in post-training reinforcement learning, which arise from the long-tailed distribution of trajectory lengths, asymmetric resource demands between training and rollout phases, and evolving policy dynamics. To overcome these challenges, the authors propose a dynamic resource scheduling framework comprising a periodic global resource planner and an elastic hybrid GPU pool that enables contention-free node reallocation. Central to this framework is a novel causality-driven multi-level feedback queue (C-MLFQ) scheduler, which dynamically prioritizes heterogeneous rollout tasks based on causal signals derived from tool invocation outcomes rather than predicted trajectory lengths. Evaluated on a 48-GPU A800 cluster, the system achieves up to a 3.0× improvement in throughput and a 2.5× acceleration in reward convergence.
📝 Abstract
Reinforcement learning (RL) has become a standard post-training paradigm for large language models (LLMs), extending beyond preference alignment to complex reasoning and multi-turn agentic behaviors. In agentic RL, the rollout stage generates trajectories while invoking tools, producing long-tailed and non-stationary workloads that challenge conventional resource-management assumptions. Three fundamental challenges arise. First, due to the long-tail distribution, a small fraction of trajectories dominates rollout makespan. Second, rollout and training exhibit strong asymmetry in compute patterns, memory demands, and sensitivity to sequence length. Third, as the RL policy evolves, the trajectory-length distribution drifts over time, rendering any static resource split progressively suboptimal.
We present Libra, which introduces two core mechanisms. The first is a periodic global resource planner that jointly optimizes GPU allocation across rollout and training clusters. It leverages an elastic hybrid pool to enable lightweight, non-blocking worker reallocation between stages. The second is a causality-driven multi-level feedback queue (C-MLFQ) scheduler, which routes requests to heterogeneous rollout buckets based on causal signals derived from tool-return outcomes, rather than relying on fragile length predictions. Evaluated on 48 A800 GPUs, Libra achieves up to 3.0$\times$ higher throughput and converges up to 2.5$\times$ faster in reward compared to the baselines.