See and Remember: A Multimodal Agent for Web Traversal

๐Ÿ“… 2026-03-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of spatial disorientation and cyclic path failures that commonly impede large language model agents in autonomous web navigation. To overcome these limitations, the authors propose V-GEMS, a multimodal agent architecture that integrates visual grounding with an explicit memory stack featuring state tracking. This design enables precise identification of interactive elements, long-term contextual awareness, and structured path backtracking. The study introduces an updatable dynamic benchmark to evaluate navigation adaptabilityโ€”a novel contribution that facilitates, for the first time in web traversal tasks, vision-grounded resolution of ambiguous elements and effective avoidance of cyclic errors. Experimental results demonstrate that V-GEMS outperforms the WebWalker baseline by 28.7% on this dynamic benchmark, substantially enhancing the robustness and efficacy of complex web navigation.

Technology Category

Application Category

๐Ÿ“ Abstract
Autonomous web navigation requires agents to perceive complex visual environments and maintain long-term context, yet current Large Language Model (LLM) based agents often struggle with spatial disorientation and navigation loops. In this paper, we propose generally applicable V-GEMS(Visual Grounding and Explicit Memory System), a robust multimodal agent architecture designed for precise and resilient web traversal. Our agent integrates visual grounding to resolve ambiguous interactive elements and introduces an explicit memory stack with state tracking. This dual mechanism allows the agent to maintain a structured map of its traversal path, enabling valid backtracking and preventing cyclical failures in deep navigation tasks. We also introduce an updatable dynamic benchmark to rigorously evaluate adaptability. Experiments show V-GEMS significantly dominates the WebWalker baseline, achieving a substantial 28.7% performance gain. Code is available at https://github.com/Vaultttttttttttt/V-GEMS.
Problem

Research questions and friction points this paper is trying to address.

web traversal
spatial disorientation
navigation loops
multimodal agent
long-term context
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Grounding
Explicit Memory
Multimodal Agent
Web Traversal
State Tracking
๐Ÿ”Ž Similar Papers
No similar papers found.
X
Xinjun Wang
Shanghai Institute of AI for Education, East China Normal University
Shengyao Wang
Shengyao Wang
Huazhong Agricultural University; National Institute for Materials Science
CO2 reductionN2 fixationNO removalCH4 activation
A
Aimin Zhou
Shanghai Institute of AI for Education, East China Normal University; Shanghai Innovation Institute
Hao Hao
Hao Hao
East China Normal University