🤖 AI Summary
This work addresses the emergent safety risks in large language model (LLM) agents, where individually safe skills can yield unsafe behaviors when composed. The authors propose SkillReact, a framework that systematically evaluates compositional risk through deterministic static composition analysis, LLM-assisted dual human review with arbitration, action-level exploitability testing, and cross-model behavioral comparison. For the first time, the study quantifies the proportion of genuine safety violations arising from skill composition, identifying approximately 14,000 vulnerable combinations among 1,520 ClawHub skills. These findings expose the limitations of conventional single-skill safety assessments and motivate a new paradigm centered on install-time checks and capability isolation. Experimental results further reveal significant disparities across models in their propensity to execute identical risky skill compositions.
📝 Abstract
LLM agents increasingly rely on community-contributed skills that expand an agent's operational capability set. We study a core safety problem in agentic AI systems: whether individually safe skills can compose into unsafe installed skill sets. We present SkillReact, a compositional security measurement framework with three components: a deterministic static-composition benchmark, a two-rater LLM-assisted human-adjudication pipeline, and an action-based exploitability harness. On 1,520 ClawHub skills, 651 pass individual inspection and form 211,575 pairs; the benchmark flags 22.25% of these as structural candidates. We treat this raw rate as a recall-oriented scanner ceiling and calibrate it against human judgment: in a pattern-stratified audit, roughly one in five flagged pair-pattern hits survives as a real compositional risk (population-weighted validity 18.2%, our headline result), implying about 14K genuine risk memberships in a single registry that per-skill scanning misses by construction, since every pair is individually safe. An action-based harness then probes when these candidates become model-issued tool calls, and finds realization gated by host-model disposition: on an anchor-conditioned dropper subset, Haiku-4-5 issues the dropper-stage tool call on all 39 direct-prompt trials (36 of them the full download-then-execute chain, 3 download-only), Opus-4-7 stops at the download, and Sonnet-4-6 refuses outright. A control that holds the request fixed and varies only the installed skills finds compliance highest with no skills installed: a composition fixes which capabilities are reachable, while the host model decides whether to use them. Together these motivate install-time compositional checks and capability isolation as complements to per-skill scanning.