🤖 AI Summary
Existing approaches to agent skill evolution typically assume a fixed set of tools and evaluate skills in isolation, struggling to handle tool-level failures and inter-skill interactions. This work proposes SkillSmith, the first framework that co-evolves skills and tools in a unified proposal space, jointly optimizing both through an ecological utility model inspired by Lotka-Volterra dynamics. SkillSmith incorporates an anti-pattern logging system to avoid conflicts and repeated failures, supports bundled operations such as encapsulation, composition, and decomposition of skill-tool pairs, and leverages execution trace analysis to enhance coordination. Evaluated across three benchmarks—including WildClawBench—and five scales of Qwen3.5 models, SkillSmith significantly outperforms strong baselines, with particularly pronounced gains in high-complexity tasks requiring multi-skill collaboration.
📝 Abstract
Recent self-evolving agents have shown that skills can be discovered, refined, and accumulated through execution. However, existing skill-evolution frameworks typically assume a fixed tool layer and evaluate each skill independently, limiting their ability to repair tool-level failures or reason about interactions among skills. We propose SkillSmith, a synergy-aware skill-tool co-evolution framework. SkillSmith introduces a unified proposal space in which reflection produces atomic bundles that jointly modify skills and tools, allowing tools to be wrapped, edited, composed, split, or retired when skill evolution identifies a reusable capability gap. To guide this joint search, SkillSmith maintains an ecological utility model inspired by Lotka-Volterra dynamics, where an interaction matrix estimated from execution traces captures pairwise complementarity and conflict among skills and provides pressure signals for retrieval, mutation prioritization, and retirement. Furthermore, SkillSmith records anti-patterns, including failure signatures, causal attributions, and remedies, to accelerate diagnosis and veto proposals that repeat known mistakes. Experiments on three benchmarks, including WildClawBench, and five Qwen3.5 model scales show that SkillSmith consistently outperforms strong baselines, with gains that amplify as task complexity and multi-skill co-activation increase.