Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

📅 2026-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
As the scale of agent skill repositories grows, the absence of systematic evaluation mechanisms undermines guarantees regarding skill utility, quality, and safety. This work introduces a unified framework for skill evaluation and evolution by formally categorizing skill evolution into four paradigms: execution feedback, trajectory distillation, compression, and reinforcement learning. Through comprehensive multidimensional benchmarking, the study systematically analyzes six existing evaluation methodologies, uncovering their structural gaps and metric limitations. By shifting the paradigm from isolated skill construction to evaluation-driven automated evolution, this research lays the foundation for developing general-purpose, efficient, and verifiably safe skill ecosystems.
📝 Abstract
The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in real-world applications. Consequently, the field is undergoing an emerging paradigm shift from isolated skill creation to automated, evaluation-driven skill evolution. In this survey, we systematically examine the landscape of skill evolution and evaluation beyond foundational skill creation. We categorize evolution into four distinct paradigms, spanning execution feedback, trajectory distillation, compression, and reinforcement learning, showing how each element contributes to improving skill utility and reliability. We also provide an analysis of six skill-centric benchmark categories, identifying structural gaps in benchmark coverage, trade-offs, and metric richness to advance skill research. Finally, we identify open directions for building skill ecosystems that are generalizable, efficient, and verifiably safe. The project URL is https://github.com/Cassie07/AgentSkill_Survey
Problem

Research questions and friction points this paper is trying to address.

agent skill evaluation
skill evolution
benchmarking
agentic systems
skill safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

skill evolution
agent evaluation
trajectory distillation
benchmark analysis
evaluation-driven learning
🔎 Similar Papers
No similar papers found.