Benchmark Datasets for Lead-Lag Forecasting on Social Platforms

📅 2025-11-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper addresses the “lead–lag forecasting” (LLF) problem in multivariate time series—predicting high-impact downstream events (e.g., citations, forks, sales) from earlier user behaviors (e.g., views, likes). To overcome the longstanding absence of standardized benchmarks, we introduce the first systematic LLF framework and release two large-scale, cross-platform, long-horizon benchmark datasets—free of survivorship bias—comprising 2.3 million scholarly articles and 3 million open-source repositories. We formally define LLF as a unified predictive paradigm grounded in temporal causality. Statistical tests empirically validate the existence and strength of lead–lag relationships across domains. Furthermore, we implement and evaluate both parametric and nonparametric regression baselines to establish performance benchmarks. Our work fills critical gaps in time-series forecasting research by providing the first dedicated data infrastructure and methodological foundation for proactive prediction in socio-technical collaboration systems.

Technology Category

Application Category

📝 Abstract

Social and collaborative platforms emit multivariate time-series traces in which early interactions-such as views, likes, or downloads-are followed, sometimes months or years later, by higher impact like citations, sales, or reviews. We formalize this setting as Lead-Lag Forecasting (LLF): given an early usage channel (the lead), predict a correlated but temporally shifted outcome channel (the lag). Despite the ubiquity of such patterns, LLF has not been treated as a unified forecasting problem within the time-series community, largely due to the absence of standardized datasets. To anchor research in LLF, here we present two high-volume benchmark datasets-arXiv (accesses ->citations of 2.3M papers) and GitHub (pushes/stars ->forks of 3M repositories)-and outline additional domains with analogous lead-lag dynamics, including Wikipedia (page views ->edits), Spotify (streams ->concert attendance), e-commerce (click-throughs ->purchases), and LinkedIn profile (views ->messages). Our datasets provide ideal testbeds for lead-lag forecasting, by capturing long-horizon dynamics across years, spanning the full spectrum of outcomes, and avoiding survivorship bias in sampling. We documented all technical details of data curation and cleaning, verified the presence of lead-lag dynamics through statistical and classification tests, and benchmarked parametric and non-parametric baselines for regression. Our study establishes LLF as a novel forecasting paradigm and lays an empirical foundation for its systematic exploration in social and usage data. Our data portal with downloads and documentation is available at https://lead-lag-forecasting.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Formalizing Lead-Lag Forecasting to predict temporally shifted outcome channels from early usage patterns

Addressing the absence of standardized datasets for lead-lag forecasting research in time-series analysis

Providing benchmark datasets capturing long-horizon dynamics across social platforms without survivorship bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces benchmark datasets for lead-lag forecasting

Formalizes Lead-Lag Forecasting as unified time-series problem

Provides statistical verification and baseline models for validation

🔎 Similar Papers

No similar papers found.

Authors to Follow