🤖 AI Summary
Existing open-source multimodal large language models (MLLMs) significantly underperform humans on online web tasks, primarily due to the lack of large-scale, cross-domain, trajectory-level multimodal training data. Method: We propose an exploration-driven framework for autonomous web interaction trajectory generation, integrating LLMs and vision models for collaborative exploration—supporting dynamic page rendering, interaction recording, multi-stage filtering, semantic refinement, and multimodal encoding with alignment-based training. Contribution/Results: Our framework constructs the largest open-source multimodal web trajectory dataset to date at low cost ($0.28 per trajectory), comprising 94K successful trajectories across 49K unique URLs and 720K screenshots. Experiments confirm data scale as a critical factor for online task performance. The resulting Explorer agent achieves state-of-the-art results among open-source methods on benchmarks including Mind2Web-Live, improving online task success rate by 32%.
📝 Abstract
Recent success in large multimodal models (LMMs) has sparked promising applications of agents capable of autonomously completing complex web tasks. While open-source LMM agents have made significant advances in offline evaluation benchmarks, their performance still falls substantially short of human-level capabilities in more realistic online settings. A key bottleneck is the lack of diverse and large-scale trajectory-level datasets across various domains, which are expensive to collect. In this paper, we address this challenge by developing a scalable recipe to synthesize the largest and most diverse trajectory-level dataset to date, containing over 94K successful multimodal web trajectories, spanning 49K unique URLs, 720K screenshots, and 33M web elements. In particular, we leverage extensive web exploration and refinement to obtain diverse task intents. The average cost is 28 cents per successful trajectory, making it affordable to a wide range of users in the community. Leveraging this dataset, we train Explorer, a multimodal web agent, and demonstrate strong performance on both offline and online web agent benchmarks such as Mind2Web-Live, Multimodal-Mind2Web, and MiniWob++. Additionally, our experiments highlight data scaling as a key driver for improving web agent capabilities. We hope this study makes state-of-the-art LMM-based agent research at a larger scale more accessible.