Benchmark Datasets for Lead-Lag Forecasting on Social Platforms

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the “lead–lag forecasting” (LLF) problem in multivariate time series—predicting high-impact downstream events (e.g., citations, forks, sales) from earlier user behaviors (e.g., views, likes). To overcome the longstanding absence of standardized benchmarks, we introduce the first systematic LLF framework and release two large-scale, cross-platform, long-horizon benchmark datasets—free of survivorship bias—comprising 2.3 million scholarly articles and 3 million open-source repositories. We formally define LLF as a unified predictive paradigm grounded in temporal causality. Statistical tests empirically validate the existence and strength of lead–lag relationships across domains. Furthermore, we implement and evaluate both parametric and nonparametric regression baselines to establish performance benchmarks. Our work fills critical gaps in time-series forecasting research by providing the first dedicated data infrastructure and methodological foundation for proactive prediction in socio-technical collaboration systems.

Technology Category

Application Category

📝 Abstract
Social and collaborative platforms emit multivariate time-series traces in which early interactions-such as views, likes, or downloads-are followed, sometimes months or years later, by higher impact like citations, sales, or reviews. We formalize this setting as Lead-Lag Forecasting (LLF): given an early usage channel (the lead), predict a correlated but temporally shifted outcome channel (the lag). Despite the ubiquity of such patterns, LLF has not been treated as a unified forecasting problem within the time-series community, largely due to the absence of standardized datasets. To anchor research in LLF, here we present two high-volume benchmark datasets-arXiv (accesses ->citations of 2.3M papers) and GitHub (pushes/stars ->forks of 3M repositories)-and outline additional domains with analogous lead-lag dynamics, including Wikipedia (page views ->edits), Spotify (streams ->concert attendance), e-commerce (click-throughs ->purchases), and LinkedIn profile (views ->messages). Our datasets provide ideal testbeds for lead-lag forecasting, by capturing long-horizon dynamics across years, spanning the full spectrum of outcomes, and avoiding survivorship bias in sampling. We documented all technical details of data curation and cleaning, verified the presence of lead-lag dynamics through statistical and classification tests, and benchmarked parametric and non-parametric baselines for regression. Our study establishes LLF as a novel forecasting paradigm and lays an empirical foundation for its systematic exploration in social and usage data. Our data portal with downloads and documentation is available at https://lead-lag-forecasting.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Formalizing Lead-Lag Forecasting to predict temporally shifted outcome channels from early usage patterns
Addressing the absence of standardized datasets for lead-lag forecasting research in time-series analysis
Providing benchmark datasets capturing long-horizon dynamics across social platforms without survivorship bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces benchmark datasets for lead-lag forecasting
Formalizes Lead-Lag Forecasting as unified time-series problem
Provides statistical verification and baseline models for validation
🔎 Similar Papers
No similar papers found.
K
Kimia Kazemian
Department of Computer Science, Cornell University
Z
Zhenzhen Liu
Department of Computer Science, Cornell University
Y
Yangfan Yang
Department of Information Science, Cornell University
Katie Z Luo
Katie Z Luo
Cornell University
Machine LearningArtificial Intelligence
S
Shuhan Gu
Department of Computer Science, Cornell University
A
Audrey Du
Department of Computer Science, Cornell University
X
Xinyu Yang
Department of Information Science, Cornell University
J
Jack Jansons
Department of Computer Science, Cornell University
K
Kilian Q. Weinberger
Department of Computer Science, Cornell University
John Thickstun
John Thickstun
Assistant Professor, Cornell University
Machine LearningGenerative ModelsMusic TechnologyNatural Language Processing
Y
Yian Yin
Department of Information Science, Cornell University
S
Sarah Dean
Department of Computer Science, Cornell University