🤖 AI Summary
Existing benchmarks struggle to disentangle the influence of an agent’s skill content from its organizational structure on runtime behavior. This work proposes the Progressive Disclosure paradigm for skill organization, which integrates semantically controlled skill variants with multi-round matching evaluation. For the first time, it systematically demonstrates that—holding task knowledge constant—the structure of skills alone significantly affects large language model agents’ procedural knowledge retrieval and reasoning trajectories, with effects modulated by task characteristics. Evaluated via the SkillJuror framework across 82 tasks, this approach increases the average number of skill invocations per trajectory from 1.18 to 3.85 and effective adoptions from 1.33 to 3.92, while yielding 17 additional verified successes (+4.1%) in 410 matching trials.
📝 Abstract
Agent Skills augment large language model (LLM) agents with procedural knowledge at inference time, but current benchmarks rarely distinguish what a Skill says from how it is organized. We study this distinction through Progressive Disclosure, where a concise root file points agents to supporting resources on demand, and compare it with a normalized flat baseline. We present SkillJuror, a framework for evaluating Skill writing paradigms through semantically controlled variants, matched multi-trial evaluations, and trajectory evidence while holding task knowledge fixed. In an 82-task SkillsBench study, Progressive Disclosure changes runtime behavior before aggregate outcomes: distinct Skill resources touched per trajectory rise from 1.18 to 3.85, and effective uptake events rise from 1.33 to 3.92. It also yields 17 additional verifier-passing trials out of 410 matched trials (+4.1%) over the normalized flat baseline. The benefit is task-dependent. Progressive Disclosure helps when supporting resources guide implementation, checking, or repair, but is weaker when success hinges on exact output conventions, numerical thresholds, or long artifact-generation pipelines. These results show that Skill organization is not mere presentation: it can change how agents search and apply procedural knowledge, while outcome gains depend on whether the exposed resources are actionable for the task. Code is available at https://github.com/zhiyuchen-ai/skill-juror.