🤖 AI Summary
This work addresses the challenge of weak and manually authored initial agent skills generated by large language models (LLMs) in cold-start scenarios. To overcome this limitation, we propose SkillRevise, a novel framework that, for the first time, enables diagnosis and repair of skill deficiencies based on execution trajectories. By analyzing these trajectories to identify failure points, retrieving relevant repair principles from a general memory bank, and integrating execution-anchored editing with utility-based experience evaluation, SkillRevise establishes a closed-loop optimization process. This approach autonomously refines initial skills without expert intervention and supports cross-model transferability. Experimental results demonstrate that SkillRevise significantly improves task success rates from 36.05% to 61.63% across three benchmarks, including SkillsBench, and exhibits strong generalization across five diverse LLMs.
📝 Abstract
Agent skills are procedural artifacts that enable LLM agents to execute workflows, verify constraints, and recover from failures. Existing self-evolving methods refine skills using accumulated trajectories. However, they struggle in cold-start settings, where only an initial, imperfect skill is available. Consequently, skill construction defaults to expert authoring or one-shot LLM generation. Expert-authored skills are costly and may not align with how LLM agents actually execute tasks, while one-shot generated skills can be syntactically well formed yet behaviorally weak. To bridge this gap, we propose SkillRevise, an execution-grounded framework designed to iteratively refine these initial skills. SkillRevise diagnoses skill defects from execution evidence, retrieves relevant repair principles from a general memory, and applies execution-anchored edits. By re-executing candidates and measuring empirical utility, it systematically retains the optimal skill version. Evaluated across three benchmarks and five LLMs, SkillRevise substantially outperforms one-shot baselines, improving the base agent's success rate on SkillsBench from 36.05% to 61.63%. Furthermore, the revised skills exhibit strong cross-model transferability, capturing generalized procedural knowledge over model-specific artifacts.