Are Online Skill and Memory Modules Always Worth Their Tokens? A Budget-Constrained Study of Web Agents

📅 2026-06-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the unclear efficacy of modular components—such as skills and memory—in online web agents when operating under a fixed total inference token budget. It presents the first systematic evaluation of online augmentation methods, including AWM, ASI, and ReasoningBank, on WebArena and WorkArena-L1 tasks, comparing their performance against token-matched vanilla baselines across Gemini 3 Flash, GPT-5.4-mini, and Qwen 3.6-27B. The results demonstrate that the performance gains reported by most augmentation approaches vanish under strict token constraints, with the module-free baseline achieving comparable or superior overall success rates while often consuming fewer tokens. The work further underscores the importance of inter-run variance as a critical evaluation metric in assessing agent reliability and consistency.

📝 Abstract

Online web agents often augment a base actor with memory, workflow, or skill modules. These modules can improve performance, but they also consume test-time tokens, a cost rarely reported alongside the actor's inference cost. We study online augmentation, where this overhead is paid on every task, and re-evaluate its benefits under a fixed total inference budget. We compare AWM, ASI, and ReasoningBank with a token-matched vanilla baseline that uses the same budget for additional actor steps. Across three WebArena domains and three models, Gemini 3 Flash, GPT-5.4-mini, and Qwen 3.6-27B, the vanilla baseline matches or surpasses all three augmentation methods in aggregate success rate while often using fewer total tokens. We observe a similar trend on WorkArena-L1 with Qwen 3.6-27B, indicating that the effect extends to enterprise knowledge-work tasks. Our results suggest that skills and workflow memory can be useful in specific domains, but their apparent gains often vanish against a budget-matched actor. We further show that run-to-run variance materially affects outcomes and should be reported as a core evaluation criterion for online web agents.

Problem

Research questions and friction points this paper is trying to address.

online web agents

token budget

memory modules

skill modules

inference cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

budget-constrained evaluation

online web agents

token efficiency