🤖 AI Summary
Prior web agent research primarily focuses on navigation and transaction tasks, neglecting large-scale structured data extraction from complex, interactive websites. Method: We introduce WebLists—a benchmark comprising 200 real-world, enterprise-grade data extraction tasks across four commercial domains—requiring agents to jointly perform webpage navigation, dynamic configuration, and high-precision schema alignment. Existing LLMs and state-of-the-art (SOTA) agents achieve only 3% and 31% recall, respectively, revealing severe generalization limitations. To address this, we propose BardeenAgent: the first agent that synthesizes generalizable CSS selectors by leveraging HTML structural regularities; it integrates HTML structure-aware action modeling with programmable replay for executable, reusable automation. Contribution/Results: On WebLists, BardeenAgent achieves 66% recall—more than double the SOTA—and reduces per-output execution cost by 3×.
📝 Abstract
Most recent web agent research has focused on navigation and transaction tasks, with little emphasis on extracting structured data at scale. We present WebLists, a benchmark of 200 data-extraction tasks across four common business and enterprise use-cases. Each task requires an agent to navigate to a webpage, configure it appropriately, and extract complete datasets with well-defined schemas. We show that both LLMs with search capabilities and SOTA web agents struggle with these tasks, with a recall of 3% and 31%, respectively, despite higher performance on question-answering tasks. To address this challenge, we propose BardeenAgent, a novel framework that enables web agents to convert their execution into repeatable programs, and replay them at scale across pages with similar structure. BardeenAgent is also the first LLM agent to take advantage of the regular structure of HTML. In particular BardeenAgent constructs a generalizable CSS selector to capture all relevant items on the page, then fits the operations to extract the data. On the WebLists benchmark, BardeenAgent achieves 66% recall overall, more than doubling the performance of SOTA web agents, and reducing cost per output row by 3x.