๐ค AI Summary
This work addresses the critical security gap in open agent skill ecosystems, where malicious skills often masquerade under benign descriptions and existing defenses lack a unified benchmark combining semantic analysis with runtime verification. To bridge this gap, we propose SkillVetBenchโthe first end-to-end security evaluation benchmark for agent skills. It operates in two stages: first, it employs natural language semantic analysis to detect latent malicious intent; second, it executes suspicious skills within an isolated sandbox, monitoring privileged primitives (e.g., exec, write_file) and inter-component interactions to generate auditable execution traces as forensic evidence. Experimental results demonstrate that approaches relying solely on semantic or signature-based detection miss up to 89% of malicious skills, whereas SkillVetBench effectively captures runtime attacks and provides concrete, interpretable evidence for definitive security judgments.
๐ Abstract
Open agent platforms allow community contributors to publish reusable skills that agents can invoke at runtime. This extensibility also creates a supply-chain risk: malicious contributors can hide harmful behavior inside skills that appear benign under superficial inspection. However, existing defenses are hard to evaluate because there is no benchmark that measures both malicious-skill detection and runtime verification. We present SkillVetBench, a two-stage security vetting benchmark for open agentic skill ecosystems. The first stage performs semantic vetting over each skill's natural-language specification to detect hidden malicious intent. The second stage executes flagged skills in an instrumented sandbox to observe runtime behavior and collect auditable evidence. We build a benchmark from confirmed malicious skills in the live OpenClaw ecosystem, including samples from the recent ClawHavoc supplychain campaign. Unlike static-only methods, SkillVetBench verifies detected threats with execution traces. Our experiments show that: (1) semantic-only and signature-based baselines are insufficient, missing up to 89\% of malicious skills whose threats arise from natural-language instructions, multicomponent logic, or cross-component interactions; (2) runtime attacks are concentrated in a small set of high-permission primitives, especially exec, write\_file, install\_skill, and spawn; and (3) SkillVetBench provides case studies in which sandbox execution directly supports malicious verdicts with concrete runtime evidence.