SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows

📅 2026-06-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the reliability degradation of AI agents when reusing workflows due to environmental drift, task ambiguity, or distributional shift. To mitigate these challenges, the authors propose a selective formalization and gated execution mechanism that dynamically determines whether each workflow step should be executed in code or natural language. The approach integrates verification gates, fallback pathways, and multimodal evidence—including outputs, screenshots, and error logs—to enable auditable, versioned, and adaptive workflow governance. Evaluated on WebArena-Verified, the method achieves a single-turn success rate of 53.7%, retains 91.7% of initially successful tasks after up to three retries, and exhibits a low regression rate of only 4.2% post-repair. It also significantly outperforms baseline approaches in cross-site, cross-domain, and GitLab migration tasks.

📝 Abstract

AI agents increasingly turn past experience into reusable artifacts such as code, workflows, and procedural memories. Reuse can improve efficiency, but it also creates a lifecycle reliability problem: artifacts that succeed once may fail under environment drift, underspecified tasks, or changing task distributions, especially in web automation. We introduce SKILL.nb, a framework for governing reusable agent workflows with evidence-calibrated lifecycle policies. SKILL.nb uses selective formalization: execution evidence decides which workflow steps should become executable code, which should remain natural-language guided, and when those choices should be revised. Workflows are stored as auditable, versioned notebooks that interleave natural-language guidance, multi-language executable cells, validation gates, fallback paths, and multimodal evidence such as outputs, screenshots, and error traces. At runtime, gate-conditioned execution lets each step run code when its gates validate, or fall back locally when drift invalidates the executable realization. On WebArena-Verified, SKILL.nb achieves 53.7% single-round success, improving over the strongest baseline by 3.9 percentage points. Across three re-executions, it retains 91.7% of initially successful tasks, 15.5 points above the next best method. Under bounded repair, it recovers 72.9% of subsequent failures while limiting post-repair regressions to 4.2%, compared with 15.0% to 17.0% for persistent baselines. It also leads on Mind2Web cross-website and cross-domain splits. In a GitLab migration test, SKILL.nb preserves performance when reusing frozen state learned on GitLab 15.7, with frozen-versus-fresh target-version gaps of -1.7 points on GitLab 16.11 and +0.6 points on GitLab 18.9. These results identify lifecycle governance and gate-conditioned execution as reliability axes beyond one-shot task success.

Problem

Research questions and friction points this paper is trying to address.

lifecycle reliability

environment drift

workflow reuse

task distribution shift

web automation

Innovation

Methods, ideas, or system contributions that make the work stand out.

selective formalization

gated execution

lifecycle governance