🤖 AI Summary
This work addresses a critical limitation in existing skill-augmented reinforcement learning methods: the lack of effective evaluation of candidate skills’ utility, which often leads to the incorporation of inefficient or detrimental skills that hinder policy learning. To mitigate this, the authors propose an online reinforcement learning framework that estimates the marginal utility of a candidate skill prior to its inclusion in the skill library through matched-pair contrastive evaluation—simultaneously executing base and skill-augmented rollouts under identical task and retrieval contexts and quantifying the skill’s contribution via reward differentials. The framework introduces, for the first time, a pre-storage validation mechanism that incurs no additional rollout overhead, jointly optimizes policy training and skill generation to render the policy a context-aware skill generator, and enhances skill library quality through skill re-ranking and obsolescence pruning. Experiments demonstrate substantial improvements in agent performance and stability on complex tasks, alongside reduced reliance on proprietary large language models.
📝 Abstract
Skill-augmented reinforcement learning improves language agents by storing reusable procedural knowledge acquired from past experience. Existing methods typically use strong language models to analyze trajectories, generate skills, and update a retrievable skill bank during online training. However, they rarely assess whether a newly generated skill is useful before it is stored and reused. We find that this assumption is unreliable: even skills generated by proprietary frontier LLMs exhibit highly mixed utility, with many providing little benefit or even degrading performance. Once such skills enter the bank, their effects are difficult to identify, because subsequent rollout feedback is delayed and usually reflects the combined effect of multiple retrieved skills rather than the marginal contribution of any individual skill. We propose an online reinforcement learning framework for pre-storage skill validation. The framework estimates whether a candidate skill contributes useful information beyond the skills already retrieved for the current task. It uses the standard rollout budget to form two matched groups under the same task and retrieval context: base rollouts conditioned on the currently retrieved skills, and skill-augmented rollouts conditioned on the same skills plus one candidate skill induced from the base trajectories. The reward gap between these two groups estimates the candidate skill's context-dependent marginal utility, enabling the framework to promote useful skills while filtering ineffective or harmful ones without additional rollout overhead. The framework further uses this marginal-utility signal to train the policy itself as a skill generator, reducing reliance on repeated calls to proprietary models. The learned skill-generation likelihood serves as a context-dependent score for retrieval-time reranking and outdated-skill pruning as the policy evolves.