WebLists: Extracting Structured Information From Complex Interactive Websites Using Executable LLM Agents

📅 2025-04-17

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Prior web agent research primarily focuses on navigation and transaction tasks, neglecting large-scale structured data extraction from complex, interactive websites. Method: We introduce WebLists—a benchmark comprising 200 real-world, enterprise-grade data extraction tasks across four commercial domains—requiring agents to jointly perform webpage navigation, dynamic configuration, and high-precision schema alignment. Existing LLMs and state-of-the-art (SOTA) agents achieve only 3% and 31% recall, respectively, revealing severe generalization limitations. To address this, we propose BardeenAgent: the first agent that synthesizes generalizable CSS selectors by leveraging HTML structural regularities; it integrates HTML structure-aware action modeling with programmable replay for executable, reusable automation. Contribution/Results: On WebLists, BardeenAgent achieves 66% recall—more than double the SOTA—and reduces per-output execution cost by 3×.

Technology Category

Application Category

📝 Abstract

Most recent web agent research has focused on navigation and transaction tasks, with little emphasis on extracting structured data at scale. We present WebLists, a benchmark of 200 data-extraction tasks across four common business and enterprise use-cases. Each task requires an agent to navigate to a webpage, configure it appropriately, and extract complete datasets with well-defined schemas. We show that both LLMs with search capabilities and SOTA web agents struggle with these tasks, with a recall of 3% and 31%, respectively, despite higher performance on question-answering tasks. To address this challenge, we propose BardeenAgent, a novel framework that enables web agents to convert their execution into repeatable programs, and replay them at scale across pages with similar structure. BardeenAgent is also the first LLM agent to take advantage of the regular structure of HTML. In particular BardeenAgent constructs a generalizable CSS selector to capture all relevant items on the page, then fits the operations to extract the data. On the WebLists benchmark, BardeenAgent achieves 66% recall overall, more than doubling the performance of SOTA web agents, and reducing cost per output row by 3x.

Problem

Research questions and friction points this paper is trying to address.

Extracting structured data from complex interactive websites

Improving recall and efficiency in web data extraction tasks

Enabling scalable execution of web agents across similar pages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Executable LLM agents for structured data extraction

Repeatable programs for scalable web data extraction

Generalizable CSS selectors for HTML structure utilization

🔎 Similar Papers

No similar papers found.