🤖 AI Summary
The scarcity of large-scale, real-world human web interaction data with high-quality annotations has severely hindered reproducible research on web-based intelligent agents. To address this gap, this work introduces a large open-source dataset comprising 31,725 trajectories (318,000 steps), pioneering a novel data paradigm that aligns visual, structural, and action modalities. We design a scalable human trajectory collection pipeline to cover high-value, complex web tasks. Furthermore, we propose a dual mid-training strategy that decouples spatial grounding from task planning, achieving state-of-the-art performance on our newly curated WebChainBench as well as multiple public GUI benchmarks. This approach significantly enhances model generalization in real-world web environments, providing critical data and methodological foundations for the next generation of scalable web agents.
📝 Abstract
We introduce WebChain, the largest open-source dataset of human-annotated trajectories on real-world websites, designed to accelerate reproducible research in web agents. It contains 31,725 trajectories and 318k steps, featuring a core Triple Alignment of visual, structural, and action data to provide rich, multi-modal supervision. The data is collected via a scalable pipeline that ensures coverage of complex, high-value tasks often missed by synthetic methods. Leveraging this dataset, we propose a Dual Mid-Training recipe that decouples spatial grounding from planning, achieving state-of-the-art performance on our proposed WebChainBench and other public GUI benchmarks. Our work provides the data and insights necessary to build and rigorously evaluate the next generation of scalable web agents.