Efficient Crawling for Scalable Web Data Acquisition (Extended Version)

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the challenge of efficiently harvesting large-scale, high-quality statistical datasets from unstructured websites. The authors propose a reinforcement learning–based focused crawling approach that formulates web page retrieval as a sleeping bandits problem, enabling dynamic learning of high-value hyperlink paths. Their proposed SB-CLASSIFIER leverages contextual features along navigation paths to intelligently identify hyperlinks leading to multi-target resource pages. Experimental evaluation on real-world websites encompassing millions of pages demonstrates that the method captures the vast majority of target resources while crawling only a small fraction of the site, significantly outperforming existing approaches in both efficiency and scalability.

Technology Category

Application Category

📝 Abstract

Journalistic fact-checking, as well as social or economic research, require analyzing high-quality statistics datasets (SDs, in short). However, retrieving SD corpora at scale may be hard, inefficient, or impossible, depending on how they are published online. To improve open statistics data accessibility, we present a focused Web crawling algorithm that retrieves as many targets, i.e., resources of certain types, as possible, from a given website, in an efficient and scalable way, by crawling (much) less than the full website. We show that optimally solving this problem is intractable, and propose an approach based on reinforcement learning, namely using sleeping bandits. We propose SB-CLASSIFIER, a crawler that efficiently learns which hyperlinks lead to pages that link to many targets, based on the paths leading to the links in their enclosing webpages. Our experiments on websites with millions of webpages show that our crawler is highly efficient, delivering high fractions of a site's targets while crawling only a small part.

Problem

Research questions and friction points this paper is trying to address.

Web crawling

statistics datasets

data accessibility

scalable data acquisition

focused crawling

Innovation

Methods, ideas, or system contributions that make the work stand out.

focused crawling

reinforcement learning

sleeping bandits