RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately evaluate AI agents’ capabilities in real-world, long-horizon web tasks—specifically their ability to interpret ambiguous instructions, adapt to evolving user mental states, and perform cross-site operations. Method: We introduce the first multi-step, user-centric web assistance benchmark, constructed from continuously collected real-user instruction traces. It emphasizes intent inference, dynamic mental state tracking, and personalized action modeling—moving beyond conventional single-step, unambiguous task paradigms. Our approach integrates multi-turn dialogue modeling, vision-language joint grounding, user-state memory mechanisms, and cross-website action planning. Results: Experiments reveal significant deficiencies in current SOTA models regarding long-horizon instruction understanding and GUI-action alignment, exposing core bottlenecks in practical web assistance. This work establishes a new evaluation standard and technical foundation for developing embodied, persona-aware AI agents capable of sustained, adaptive web interaction.

Technology Category

Application Category

📝 Abstract
To achieve successful assistance with long-horizon web-based tasks, AI agents must be able to sequentially follow real-world user instructions over a long period. Unlike existing web-based agent benchmarks, sequential instruction following in the real world poses significant challenges beyond performing a single, clearly defined task. For instance, real-world human instructions can be ambiguous, require different levels of AI assistance, and may evolve over time, reflecting changes in the user's mental state. To address this gap, we introduce RealWebAssist, a novel benchmark designed to evaluate sequential instruction-following in realistic scenarios involving long-horizon interactions with the web, visual GUI grounding, and understanding ambiguous real-world user instructions. RealWebAssist includes a dataset of sequential instructions collected from real-world human users. Each user instructs a web-based assistant to perform a series of tasks on multiple websites. A successful agent must reason about the true intent behind each instruction, keep track of the mental state of the user, understand user-specific routines, and ground the intended tasks to actions on the correct GUI elements. Our experimental results show that state-of-the-art models struggle to understand and ground user instructions, posing critical challenges in following real-world user instructions for long-horizon web assistance.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI agents in long-horizon web-based tasks with real users
Addressing challenges in sequential, ambiguous, and evolving user instructions
Grounding user intent to GUI actions in realistic web scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for sequential real-world web tasks
Visual GUI grounding and ambiguous instruction handling
Tracks user mental state and specific routines
🔎 Similar Papers
No similar papers found.
S
Suyu Ye
Johns Hopkins University
H
Haojun Shi
Johns Hopkins University
D
Darren Shih
Johns Hopkins University
Hyokun Yun
Hyokun Yun
Machine Learning Scientist at Amazon
Machine LearningStatisticsArtificial IntelligenceNatural Language ProcessingOptimization
T
Tanya Roosta
Amazon.com
Tianmin Shu
Tianmin Shu
Assistant Professor, JHU
Artificial IntelligenceCognitive Science