π€ AI Summary
High acquisition costs and poor data quality for Layer-2 (L2) blockchain data severely hinder data-driven research in emerging ecosystems such as ZKsync. To address this, we construct and open-source the first high-quality, structured dataset comprehensively capturing one year of on-chain activity on ZKsync Eraβfilling a critical gap in publicly available, high-fidelity L2 chain data. Leveraging an archival node, our pipeline employs batch synchronization, transaction decoding, state snapshot extraction, and schema normalization to produce a standardized, Parquet-formatted dataset optimized for SQL querying. It comprises over 120 million transactions, tens of millions of addresses, and complete smart contract deployment records. We also release a fully reproducible data extraction workflow and analytical templates. This dataset has already enabled cutting-edge research in MEV modeling, gas optimization, and zk-SNARK verification pattern analysis.
π Abstract
Despite blockchain data being publicly available, practical challenges and high costs often hinder its effective use by researchers, thus limiting data-driven research and exploration in the blockchain space. This is especially true when it comes to Layer-2 (L2) ecosystems, and ZKsync, in particular. To address these issues, we have curated a dataset from 1 year of activity extracted from a ZKsync Era archive node and made it freely available to external parties. We provide details on this dataset and how it was created, showcase a few example analyses that can be performed with it, and discuss some future research directions.