Skill Availability and Presentation Granularity in Large-Language-Model Agents: A Controlled SkillsBench Study

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

185K/year
🤖 AI Summary
This study investigates how the availability and granularity of skill documentation influence the success rate of large language model (LLM) agents in downstream reasoning tasks. Using a controlled experimental design on a standardized, domain-balanced 30-task subset of the SkillsBench benchmark, the authors systematically evaluate two state-of-the-art LLMs across varying levels of skill abstraction and exemplar configurations. Statistical robustness is ensured through bootstrap confidence intervals and mean reward stability tests. Results demonstrate that skill availability significantly boosts task success—by up to 36.0 percentage points—whereas the impact of presentation granularity is negligible, inconsistent, and highly model-dependent. This work provides the first quantitative disentanglement of the independent effects of skill availability and granularity, offering empirical guidance for the design of agent skill representations.
📝 Abstract
Skill documents provide procedural knowledge to large-language-model agents at inference time. This article studies whether the presentation granularity of controlled skill knowledge changes downstream task success. The experiment uses a pinned SkillsBench version, a 30-task domain-balanced subset validated by official oracle runs, two reasoning-enabled model configurations, six skill conditions, and five trials per task-condition-model cell. Skill availability is the clearest empirical signal. Relative to no skill, skill conditions increase task-mean pass rate by 26.7 to 36.0 percentage points for GPT-5.5 and by 18.0 to 26.0 percentage points for DeepSeek V4-Flash. The final data contain 1,800 rows, with 900 rows for each model. The task is the inference unit. Five trials are aggregated within each task-condition-model cell before paired contrasts are estimated over 30 tasks. The primary presentation contrasts are smaller and uncertain. Low-abstraction guidance differs from high-abstraction guidance by +0.7 percentage points for GPT-5.5 and -6.7 percentage points for DeepSeek V4-Flash, with both 95% bootstrap confidence intervals crossing zero. Adding one worked example to medium-abstraction guidance differs from the no-example variant by +0.7 and +1.3 percentage points. Mean-reward robustness checks preserve the same substantive conclusion. In this controlled subset, skill availability is associated with higher success than no skill, while the tested presentation-granularity changes yield small, uncertain, and model-dependent effects.
Problem

Research questions and friction points this paper is trying to address.

skill availability
presentation granularity
large language model agents
task success
SkillsBench
Innovation

Methods, ideas, or system contributions that make the work stand out.

skill availability
presentation granularity
large language model agents
controlled benchmarking
SkillsBench
🔎 Similar Papers