WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions

๐Ÿ“… 2025-10-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing benchmarks lack fine-grained evaluation capabilities for multimodal AI agents performing subtasks in web navigationโ€”such as date selection and scroll positioning. This paper introduces WebNav, the first benchmark explicitly designed for GUI-level subtask evaluation. Built upon the Web ARChive (WARC), it provides a reproducible, sandboxed, dynamic web interaction environment that faithfully reconstructs real-world interface behaviors. Methodologically, we initialize models via supervised fine-tuning (SFT) and propose verifiable-reward reinforcement learning (RLVR) to mitigate data scarcity. Experiments show that our best-performing model achieves a success rate of 64.8%; RLVR improves the SFT baseline from 48.8% to 52.8%, significantly outperforming multiple state-of-the-art models. This work fills a critical gap in fine-grained web interaction evaluation and establishes a new paradigm for assessing controllable, low-level operational capabilities of multimodal agents.

Technology Category

Application Category

๐Ÿ“ Abstract
Training web agents to navigate complex, real-world websites requires them to master $ extit{subtasks}$ - short-horizon interactions on multiple UI components (e.g., choosing the correct date in a date picker, or scrolling in a container to extract information). We introduce WARC-Bench (Web Archive Benchmark), a novel web navigation benchmark featuring 438 tasks designed to evaluate multimodal AI agents on subtasks. WARC-Bench enables sandboxed interactions with dynamic and realistic webpages using Web ARChive files. We show that WARC-Bench is challenging for leading computer-use models, with the highest observed success rate being 64.8%. To improve open source models on subtask, we explore two common training techniques: supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). Experiments show that SFT models obtain a 48.8% success rate on the benchmark. Training with RLVR over SFT checkpoints, even in data-scarce settings, improves the score to 52.8% on WARC-Bench, outperforming many frontier models. Our analysis concludes that mastering these subtasks is essential for robust web planning and navigation, and is a capability not extensively evaluated by existing benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Benchmark evaluates multimodal AI agents on web GUI subtasks
Training techniques improve agent performance on complex webpage interactions
Addresses capability gap in existing benchmarks for robust web navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Web ARChive files enable sandboxed webpage interactions
Supervised fine-tuning trains models on subtask execution
Reinforcement learning with verifiable rewards enhances performance
๐Ÿ”Ž Similar Papers
No similar papers found.
Sanjari Srivastava
Sanjari Srivastava
Masters in Computer Science, Stanford University
Deep LearningNatural Language ProcessingExplainable AI
G
Gang Li
Uniphore
C
Cheng Chang
Uniphore
R
Rishu Garg
Uniphore
M
Manpreet Kaur
Uniphore
C
Charlene Y. Lee
Uniphore
Y
Yuezhang Li
Uniphore
Y
Yining Mao
Uniphore
Ignacio Cases
Ignacio Cases
Uniphore
Yanan Xie
Yanan Xie
Orby AI
Natural Language ProcessingMachine LearningService Computing
P
Peng Qi
Uniphore