Detecting Pipeline Failures through Fine-Grained Analysis of Web Agents

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current Web agent evaluations predominantly rely on end-to-end success rates, overlooking intermediate-step failures and thus impeding precise failure diagnosis and systematic optimization. To address this, we propose a modular, fine-grained evaluation framework that decomposes agent execution into interpretable stages—comprehension, planning, action, and verification—and integrates structured diagnostic metrics into each stage using SeeAct and Mind2Web benchmarks. Our approach uncovers latent weaknesses overlooked by standard evaluations, such as DOM element localization drift and semantic misinterpretation of actions. Experiments demonstrate significantly improved error attribution accuracy, enabling targeted debugging and actionable insights for iterative development. This framework provides a principled foundation for designing robust, generalizable Web agents through stage-wise analysis and improvement.

Technology Category

Application Category

📝 Abstract
Web agents powered by large language models (LLMs) can autonomously perform complex, multistep tasks in dynamic web environments. However, current evaluations mostly focus on the overall success while overlooking intermediate errors. This limits insight into failure modes and hinders systematic improvement. This work analyzes existing benchmarks and highlights the lack of fine-grained diagnostic tools. To address this gap, we propose a modular evaluation framework that decomposes agent pipelines into interpretable stages for detailed error analysis. Using the SeeAct framework and the Mind2Web dataset as a case study, we show how this approach reveals actionable weaknesses missed by standard metrics - paving the way for more robust and generalizable web agents.
Problem

Research questions and friction points this paper is trying to address.

Lack fine-grained diagnostic tools for web agents
Overlook intermediate errors in agent pipelines
Need detailed error analysis for systematic improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular evaluation framework decomposes agent pipelines
Fine-grained diagnostic tools for detailed error analysis
Interpretable stages reveal actionable agent weaknesses
🔎 Similar Papers
No similar papers found.