SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing approaches to skill self-evolution in large language model agents typically rely on single-task trajectories, lack effective validation of candidate skill patches, and require loading the entire skill library during inference, which hinders both efficiency and generalization. This work proposes SkillCAT, a framework that enables efficient and reliable skill evolution through three stages: first, extracting causal contrastive evidence (CCE) from comparisons between successful and failed trajectories; second, replaying and hierarchically aggregating validated skill patches (AAE); and third, constructing a sub-skill routing topology to support on-demand invocation (TTE). Requiring no additional training, SkillCAT achieves an average performance gain of 40.40% over the strongest baseline across benchmarks including SpreadsheetBench, WikiTableQuestions, and DocVQA, while demonstrating exceptional cross-model and out-of-distribution generalization capabilities.

📝 Abstract

Skill self-evolution methods for LLM agents aim to turn execution trajectories into reusable skill documents, but current pipelines typically learn from one trajectory per task, merge candidate skill patches before checking them, and load the full skill corpus before inference. We propose SkillCAT, a training-free framework that separates this process into three stages. Contrastive Causal Extraction (CCE) samples multiple trajectories for each task and compares same-task success/failure pairs to identify evidence that explains outcome differences. Assessment-Augmented Evolution (AAE) replays each candidate patch on source-task clones and keeps only patches that improve or preserve task outcomes before hierarchical skill patch merging. Topology-Aware Task Execution (TTE) compiles the evolved skills into a routable sub-skill topology, so inference loads only the capability nodes relevant to the task. We evaluate SkillCAT on common agent benchmarks, including SpreadsheetBench, WikiTableQuestions, and DocVQA, and further test cross-model and out-of-distribution generalization. Across these settings, SkillCAT raises the average score over baselines by up to 40.40%, demonstrating reliable skill evolution without model training.

Problem

Research questions and friction points this paper is trying to address.

skill self-evolution

LLM agents

execution trajectories

skill generalization

inference efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive Causal Extraction

Assessment-Augmented Evolution

Topology-Aware Execution