Longitudinal Sampling of URLs From the Wayback Machine

📅 2025-07-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Web page longevity and archival sustainability remain poorly understood due to the lack of large-scale, temporally representative empirical studies. Method: Leveraging 26 years of Internet Archive data, we construct the first longitudinal web sample—comprising 27.3 million URLs—with strong temporal representativeness. We propose a multidimensional stratified sampling strategy based on first archival time, MIME type, URL depth, and top-level domain; integrate ZipNum indexing with the CDX API for efficient systematic sampling; and apply upsampling and downsampling to correct temporal distribution skew and domain-level bias. After filtering HTML pages and normalizing top-level URLs, we generate TimeMaps covering 3.8 billion archival records, enabling annual, million-scale balanced sampling. Contribution/Results: This work establishes the most comprehensive, reproducible empirical foundation to date for studying web persistence, enabling rigorous, longitudinal analysis of webpage survival and archival coverage.

Technology Category

Application Category

📝 Abstract

We document strategies and lessons learned from sampling the web by collecting 27.3 million URLs with 3.8 billion archived pages spanning 26 years (1996-2021) from the Internet Archive's (IA) Wayback Machine. Our goal is to revisit fundamental questions regarding the size, nature, and prevalence of the publicly archivable web, in particular, to reconsider the question: "How long does a web page last?" Addressing this question requires obtaining a sample of the web. We proposed several dimensions to sample URLs from the Wayback Machine's holdings: time of first archive, HTML vs. other MIME types, URL depth (top-level pages vs. deep links), and top-level domain (TLD). We sampled 285 million URLs from IA's ZipNum index file, which contains every 6000th line of the CDX index. These indexes also include URLs of embedded resources such as images, CSS, and JavaScript. To limit our sample to "web pages" (i.e., pages intended for human interaction), we filtered for likely HTML pages based on filename extension. We then queried IA's CDX API to determine the time of first capture and MIME type of each URL. We grouped 92 million text/html URLs based on year of first capture. Archiving speed and capacity have increased over time, so we found more URLs archived in later years. To counter this, we extracted top-level URLs from deep links to upsample earlier years. Our target was 1 million URLs per year, but due to sparseness during 1996-2021, we clustered those years, collecting 1.2 million URLs for that range. Popular domains like Yahoo and Twitter were over-represented, so we performed logarithmic-scale downsampling. Our final dataset contains TimeMaps of 27.3 million URLs, comprising 3.8 billion archived pages. We convey lessons learned from sampling the archived web to inform future studies.

Problem

Research questions and friction points this paper is trying to address.

Estimate longevity and prevalence of archivable web pages

Develop strategies for sampling URLs from Wayback Machine

Balance temporal and domain representation in web archives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sampled URLs from Wayback Machine's ZipNum index

Filtered HTML pages by filename extension

Downsampled over-represented domains logarithmically

🔎 Similar Papers

No similar papers found.

Authors to Follow